Data Engineering Roadmap

Data Engineering Roadmap

Data Engineering Roadmap

Six months. Seven stages. From SQL basics to running data systems at scale.

A real, ordered learning path for becoming a senior data engineer. Each stage builds on the last one. You finish each stage by shipping something small, not by collecting certificates.

How to read this page

Read top to bottom. Do the stages in order. The order matters more than people admit. If you try to learn Kafka before you can read a query plan, you will paste tutorials together and hope. Most people who get stuck in data engineering get stuck because they skipped the boring basics: SQL, modeling, and what a transaction actually is.

Two paces:

  • If you already write code at work, plan on four months.
  • If you are still learning to code or come from analytics, plan on eight months.

Either way, the structure is the same.


The journey, in one picture

flowchart LR
    S1[1. SQL fundamentals<br/>Month 1]:::a
    S2[2. Modeling and warehousing<br/>Month 2]:::a
    S3[3. Batch and orchestration<br/>Month 3]:::b
    S4[4. Streaming<br/>Month 4]:::b
    S5[5. Storage and formats<br/>Month 5]:::c
    S6[6. Reliability and cost<br/>Month 6]:::c
    S7[7. Interview craft<br/>parallel from week 1]:::d

    S1 --> S2 --> S3 --> S4 --> S5 --> S6
    S7 -.- S6

    classDef a fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    classDef b fill:#dcfce7,stroke:#15803d,color:#14532d
    classDef c fill:#fed7aa,stroke:#c2410c,color:#7c2d12
    classDef d fill:#fef3c7,stroke:#a16207,color:#713f12

Stages 1 to 6 are sequential. Stage 7 runs alongside the whole thing.


What you can do at each level

A quick honesty check. Where are you now, where do you want to be?

LevelWhat you can doWhat people pay you for
JuniorWrite a SQL query. Load a CSV into a warehouse. Run a notebook.Filling in pieces of someone else’s pipeline.
MidBuild a batch pipeline from source to dashboard. Read a query plan. Backfill safely.Owning a pipeline end to end and keeping it healthy.
SeniorChoose between batch and streaming. Design a warehouse model. Spot the cost bomb before it ships.Designing pipelines other engineers will build.
StaffSet platform direction. Predict the failure mode you have never seen. Have an opinion on every trade-off.Setting direction across teams. Catching the failure nobody else sees.

By the end of Stage 4 you are mid-level. By the end of Stage 6 you are senior. Staff comes from production scars, not from a roadmap.


Stage 1: SQL fundamentals

Goal. Be fluent in SQL. Not “I can write a SELECT”, but “I can read a 200-line query someone else wrote and tell you what it does and where it is slow.”

The picture in your head.

flowchart LR
    Q["Your query"]:::q --> P["Planner"]:::p
    P -->|"chooses scan + joins"| E["Execution plan"]:::p
    E --> R[("Result")]:::r

    classDef q fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    classDef p fill:#dcfce7,stroke:#15803d,color:#14532d
    classDef r fill:#fed7aa,stroke:#c2410c,color:#7c2d12

Topics.

GroupTopics
Selecting and filteringSELECT, WHERE, ORDER BY, LIMIT, DISTINCT, IS NULL semantics.
JoinsINNER, LEFT, RIGHT, FULL OUTER, CROSS, anti-joins. When each one fits.
AggregationGROUP BY, HAVING, COUNT, SUM, AVG, GROUPING SETS, ROLLUP, CUBE.
Window functionsOVER, PARTITION BY, ROW_NUMBER, RANK, LAG, LEAD, running totals.
CTEs and subqueriesWITH, correlated vs uncorrelated, why CTEs help readability.
Reading a planEXPLAIN, EXPLAIN ANALYZE, seq scan vs index scan, hash join vs nested loop.
Set theory and datesUNION vs UNION ALL, INTERSECT, EXCEPT, date arithmetic, time zones.

Build this in week 4. Find a small public dataset (NYC taxi, Stack Overflow dump, anything 1 to 5 GB). Load it into Postgres or DuckDB. Write 10 queries that answer real questions about it. Run EXPLAIN on each. Add one index that makes a slow query 10x faster.

You are done when you can look at any reasonable SQL query and predict roughly how long it will take to run on a given table size, before you press execute.


Stage 2: Data modeling and warehousing

Goal. Decide what tables to build. Models age slower than code. A bad model will haunt the company for years.

The picture in your head.

flowchart LR
    OLTP[("App DB<br/>OLTP, row store")]:::a -->|"daily load"| STG[("Staging")]:::s
    STG -->|"transform"| FACT[("Fact: orders")]:::f
    STG -->|"transform"| DIM[("Dim: customers")]:::d
    FACT -.->|"FK"| DIM

    classDef a fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    classDef s fill:#fef3c7,stroke:#a16207,color:#713f12
    classDef f fill:#fed7aa,stroke:#c2410c,color:#7c2d12
    classDef d fill:#dcfce7,stroke:#15803d,color:#14532d

Topics.

GroupTopics
OLTP vs OLAPRow stores vs column stores. Why the same table is laid out differently in each.
Normalisation1NF, 2NF, 3NF. When to denormalise on purpose.
Dimensional modelingFacts, dimensions, surrogate keys, conformed dimensions.
Slowly changing dimensionsSCD Type 1, Type 2, Type 3. Which one your stakeholders actually need.
Schemas in the warehouseStar, snowflake, data vault. Trade-offs between them.
GrainWhat one row in this fact table means. The most underrated question in modeling.
NamingConsistent prefixes, plurals, tenses. Future you will thank present you.

Build this in week 8. Take your Stage 1 dataset. Design a small star schema for it: one fact table, two or three dimensions. Add a Type 2 SCD for one dimension. Write the load query that handles a row update without breaking history.

You are done when someone asks “what is the grain of this table?” and you answer in one sentence without checking.


Stage 3: Batch pipelines and orchestration

Goal. Move data from A to B, on a schedule, without waking up at 3am. This is the bread-and-butter of the job.

The picture in your head.

flowchart LR
    SRC[("Source: API / DB / files")]:::src -->|"extract"| LAND[("Landing zone")]:::stg
    LAND -->|"transform"| WH[("Warehouse")]:::wh
    WH -->|"serve"| BI[("BI / ML")]:::out

    SCH(["Scheduler"]):::sch -.->|"triggers"| LAND
    SCH -.->|"triggers"| WH

    classDef src fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    classDef stg fill:#fef3c7,stroke:#a16207,color:#713f12
    classDef wh fill:#fed7aa,stroke:#c2410c,color:#7c2d12
    classDef out fill:#dcfce7,stroke:#15803d,color:#14532d
    classDef sch fill:#e9d5ff,stroke:#7e22ce,color:#581c87

Topics.

GroupTopics
ETL vs ELTWhy ELT won for warehousing. When ETL still makes sense.
IdempotencyWhat it really means. Why every task should be safe to re-run.
Incremental loadsWatermarks, CDC, merge vs full refresh. Picking the right one.
OrchestrationDAGs, dependencies, sensors, backfills, retries with backoff.
ToolsAirflow, Dagster, Prefect, dbt. What each one is good at.
Testing dataSchema tests, freshness tests, uniqueness tests. Where dbt tests fit.
BackfillsHow to backfill 6 months without melting the warehouse or blowing the budget.

Build this in week 12. Wrap your Stage 2 work in dbt or Airflow. Add: a daily refresh, one incremental model with a watermark, three data tests, and a backfill command. Run it on a free Airflow image or in dbt Cloud.

You are done when you can take any source-to-dashboard pipeline request, sketch the DAG on a napkin in 5 minutes, and confidently estimate it in days, not weeks.


Stage 4: Streaming and event-driven

Goal. Handle data that arrives one event at a time, often out of order, often late, sometimes twice. The hardest mental shift in data engineering.

The picture in your head.

flowchart LR
    P["Producers"]:::p -->|"events"| K[("Kafka topic")]:::k
    K -->|"consumes"| C["Stream processor"]:::c
    C -->|"writes"| SINK[("Sink: warehouse / index")]:::s

    classDef p fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    classDef k fill:#fed7aa,stroke:#c2410c,color:#7c2d12
    classDef c fill:#dcfce7,stroke:#15803d,color:#14532d
    classDef s fill:#fef3c7,stroke:#a16207,color:#713f12

Topics.

GroupTopics
Kafka basicsTopics, partitions, offsets, consumer groups, retention.
Delivery semanticsAt-most-once, at-least-once, exactly-once. What each actually guarantees.
TimeEvent time vs processing time. Why this distinction matters more than it sounds.
WatermarksWhat they are. Why late data is the default, not the exception.
Schema evolutionAvro, Protobuf, schema registry, backward and forward compatibility.
Stateful streamingJoins, windows (tumbling, sliding, session), state stores.
ToolsKafka Streams, Flink, Spark Structured Streaming, ksqlDB. What each is for.

Build this in week 16. Stand up local Kafka (Docker is fine). Produce events with a script. Consume them with Kafka Streams or Flink. Compute a tumbling 1-minute count. Send one event late, see what happens. Add a schema registry and break compatibility on purpose.

You are done when someone says “let us add streaming” and your first three questions are about ordering, late events, and what happens when a consumer falls behind, not about which tool to pick.


Stage 5: Storage and file formats

Goal. Decide where bytes live and in what shape. Cheap to ignore until your bill or your latency tells you otherwise.

The picture in your head.

flowchart LR
    RAW[("Raw: JSON / CSV<br/>cold tier")]:::r -->|"convert"| COL[("Columnar: Parquet<br/>warm tier")]:::c
    COL -->|"transactional layer"| LH[("Iceberg / Delta")]:::l
    LH -->|"query"| Q["Trino / Spark / DuckDB"]:::q

    classDef r fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    classDef c fill:#fed7aa,stroke:#c2410c,color:#7c2d12
    classDef l fill:#dcfce7,stroke:#15803d,color:#14532d
    classDef q fill:#fef3c7,stroke:#a16207,color:#713f12

Topics.

GroupTopics
File formatsCSV, JSON, Avro, Parquet, ORC. When each one wins.
Columnar storageWhy Parquet is the right default for analytics. Encoding, compression, predicate pushdown.
PartitioningHow to partition. How to not partition. Small file problem.
Clustering and Z-orderingWhen partitioning is not enough.
Object storageS3, GCS, ADLS. Consistency model. Cost of LIST.
LakehouseIceberg, Delta Lake, Hudi. ACID on top of files, table format, time travel.
Hot vs coldTiers. Lifecycle policies. Glacier and what it actually costs to read back.

Build this in week 20. Take your Stage 3 output. Save it as Parquet to local disk or S3 (Minio is fine locally). Partition by date. Query it with DuckDB. Convert one table to Iceberg or Delta. Do a time-travel query. Read the small file count and fix it.

You are done when a teammate asks “should we keep this in Parquet or in the warehouse?” and your answer is a cost and a query pattern, not a vibe.


Stage 6: Reliability, debugging, and cost

Goal. Run pipelines the way you would run a service. This is what separates a senior data engineer from a mid one.

The picture in your head.

flowchart LR
    PIPE["Pipeline"]:::p --> METRICS[("Metrics + logs")]:::m
    PIPE --> LINE[("Lineage")]:::l
    PIPE --> DQ[("Data quality")]:::q
    METRICS --> AL(["Alerts"]):::a
    DQ --> AL

    classDef p fill:#dcfce7,stroke:#15803d,color:#14532d
    classDef m fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    classDef l fill:#fed7aa,stroke:#c2410c,color:#7c2d12
    classDef q fill:#fef3c7,stroke:#a16207,color:#713f12
    classDef a fill:#fecaca,stroke:#b91c1c,color:#7f1d1d

Topics.

GroupTopics
ObservabilityLogs, metrics, traces. SLAs, SLOs, error budgets for data.
LineageColumn-level lineage. Why dashboards break and how to find out before users do.
Data qualityFreshness, completeness, uniqueness, drift. Great Expectations, dbt tests, custom checks.
DebuggingSlow queries, skewed joins, OOM in Spark, Airflow stuck in queued. How to find the cause fast.
CapacityEstimating warehouse credits or cluster size before someone else estimates them for you.
CostReading a query bill. Killing the top three offenders. Warehouse auto-scaling vs fixed.
Incident responsePostmortems that change something. Blameless writing. The one-pager template.

Build this in week 24. Add monitoring to your Stage 3 pipeline. Pick three things you would page yourself for and three you would not. Wire them up. Trigger a fake incident: change a source schema. Write a one-page postmortem.

You are done when you can take a vague “the dashboard is wrong” complaint and trace it to the exact query, table, or upstream change in under an hour.


Stage 7: Interview craft (parallel from week 1)

Goal. Be the person who can also explain it on a call, on a whiteboard, in 45 minutes.

This stage runs alongside Stages 1 to 6. Spend one hour a week on it from day one.

Topics.

GroupTopics
DE system designDesigning a metrics pipeline, a CDC pipeline, an event collection system.
Trade-off framing“Here are three options, here is what each one costs, here is what I would pick and why.”
EstimationRows per day. Bytes per row. Storage cost per month. Doing this in your head.
Narrating debuggingWalking through a real incident out loud, with the wrong turns.
Behaviour interviewsSTAR stories about ownership, conflict, and the worst day you had at work.

Build this every week. One mock interview a week. Record yourself answering one DE design question, then watch it back the next day. The first three times will hurt; that is the point.

You are done when you can do one DE design question end to end, out loud, in 40 minutes, without a panic spike when the interviewer says “scale this to 10x.”


The full topic matrix

StageCore topicsTools you should touch
1SQL fluency, query plansPostgres or DuckDB
2Modeling, SCDs, grainAny warehouse, dbt (light)
3Orchestration, idempotency, testsAirflow or Dagster, dbt
4Streaming, watermarks, schema registryKafka, Flink or Kafka Streams
5Parquet, partitioning, lakehouseS3 or Minio, Iceberg or Delta, DuckDB
6Lineage, DQ, cost, incidentsAny observability stack, OpenLineage
7Design, narration, estimationWhiteboard, webcam, a friend

The 6-month plan, week by week

MonthWeeksFocusBuild
11 to 4SQL fundamentalsQuerying and indexing a small dataset
25 to 8Modeling and warehousingStar schema with a Type 2 SCD
39 to 12Batch and orchestrationdbt or Airflow pipeline with tests and backfill
413 to 16StreamingKafka + stream processor with windows
517 to 20Storage and formatsParquet on object storage + lakehouse table
621 to 24Reliability and costMonitoring, alerts, one postmortem
All1 to 24Interview craftOne mock per week

A short note before you start

This roadmap is opinionated. There are other paths. The point is to pick one and finish it.

Do not skip Stage 1 because you “already know SQL”. Most engineers think they know SQL and then write queries that scan a billion rows when an index would have answered in milliseconds. Spend the four weeks. Read your own EXPLAIN plans. You will not regret it.

The build exercises are the whole point. A stage you only read is a stage you did not finish.