Data Engineering

Data Engineering

Data Engineering Track

Real production problems, not toy puzzles

65 hand-picked scenarios inspired by what actually breaks in data pipelines, late streams, schema drift, silent ETL failures, cost incidents, debugging the wrong number. Filter by category, difficulty or topic and start solving.

75Problems
15Categories
3Levels
Source repo
🧭
Start here

Data Engineering Roadmap — Beginner to Expert

A staged learning path from SQL fluency to running data systems at scale. Six months, in order. Each stage ends with something you build, not a quiz.

7 stages60+ topicsNo prerequisites
75 of 75 problems
#1 Easy

Log File Error Analysis

Logs and Monitoring
file streamingcounterstop-NIoT logs
#2 Easy

Rolling Average of Sensor Readings

Streaming
rolling windowdequeIoT sensorsreal-time
#3 Medium

Transform and Clean Raw Data for Analytics

Data Cleaning
CSVvalidationregexdate checks
#4 Medium

Schema Evolution and Validation for Streaming Events

Schema Validation
JSONschema evolutiontype coercionpydantic
#5 Medium

Merging Messy CSVs from Multiple Partners

Data Integration
CSVcolumn mappingdate parsingfile walk
#6 Easy

Partitioning vs Clustering in BigQuery

Fundamentals
BigQuerypartitioningclusteringcost
#7 Easy

ETL vs ELT and Why ELT Won

Fundamentals
ETLELTdbtwarehouse
#8 Easy

OLTP vs OLAP

Fundamentals
OLTPOLAPcolumn storerow store
#9 Medium

Idempotency in Data Pipelines

Fundamentals
idempotencyretriesMERGEpartitions
#10 Medium

Slowly Changing Dimensions

Fundamentals
SCDdimensionshistorydbt snapshot
#11 Medium

Data Contracts in Plain Words

Fundamentals
data contractsschema registryownership
#12 Easy

Parquet vs CSV vs JSON

Fundamentals
ParquetCSVJSONcolumnar storage
#13 Medium

Data Lake vs Warehouse vs Lakehouse

Fundamentals
lakewarehouselakehouseIceberg +1
#14 Medium

Exactly Once Delivery

Fundamentals
exactly onceidempotencyKafkastreaming
#15 Medium

Teaching SQL Performance to a Junior

SQL Thinking
EXPLAINperformancementoringoptimization
#16 Medium

SELECT DISTINCT Hiding Join Bugs

SQL Thinking
DISTINCTjoinsgrainsemi-join
#17 Medium

Reading an EXPLAIN Plan

SQL Thinking
EXPLAINquery planjoinssort spill
#18 Medium

CTE vs Subquery

SQL Thinking
CTEsubquerymaterializationrecursion
#19 Medium

Same Query Different Answers

SQL Thinking
time zonesRLSsession settingsdebugging
#20 Medium

Window Functions vs GROUP BY

SQL Thinking
window functionsGROUP BYrunning totalsranking
#21 Hard

Data Platform for an Electricity Retailer

System Design
smart meterIoTwarehousebatch
#22 Hard

Banking App Monthly Spending Widget

System Design
streamingCDCserving storelow latency
#23 Hard

Ride Hailing Surge Pricing

System Design
streamingH3real-timepricing
#24 Hard

Spotify Minutes Listened This Week

System Design
streaming aggregationKV storewatermarks
#25 Hard

Smart Meter to Monthly Bill PDF

System Design
billingSCD2idempotencyaudit
#26 Hard

Delivery Idle Driver Tracking

System Design
streamingH3TTLgeospatial
#27 Medium

Year in Review Recap

System Design
batchKV storeCDNimage render
#28 Medium

Low Balance Notification Pipeline

System Design
batchidempotencytime zonesnotifications
#29 Medium

Daily Report Quietly Wrong for Two Weeks

Scenarios
incidentpostmortemcommsdata quality
#30 Medium

Warehouse Cost Doubled in Two Months

Scenarios
costgovernancecommsINFORMATION_SCHEMA
#31 Easy

The Dashboard is Wrong

Scenarios
trustcommsvague reports
#32 Medium

Inheriting a Pipeline No One Owns

Scenarios
ownershipjudgementrewrite-or-not
#33 Medium

Executive Needs a Number Tomorrow

Scenarios
commsexeccaveatsprioritization
#34 Hard

Three Days of Data Lost

Scenarios
Kafka retentionreplayrecoverypostmortem
#35 Medium

Lambda vs Cloud Function vs Cloud Run

Cloud Decisions
serverlessAWSGCPruntime limits
#36 Easy

Scheduled Pipeline Pay Only When Run

Cloud Decisions
scheduled jobsCloud Run JobsAWS Batch
#37 Medium

BigQuery vs Snowflake for New Team

Cloud Decisions
BigQuerySnowflakepricing model
#38 Easy

Store Partner Files in S3 or Warehouse

Cloud Decisions
S3raw layerauditschema evolution
#39 Medium

Managed Airflow vs Self Hosted

Cloud Decisions
AirflowMWAAComposerAstronomer +1
#40 Medium

BigQuery Access Control for 50 Person Company

Cloud Decisions
IAMdatasetsgroupsRLS +1
#41 Medium

Tables for an Airbnb Like App

Data Modeling
star schemaSCD2multi-currencyreviews
#42 Medium

Tracking Subscription Plan History

Data Modeling
historyvalid_from/tobillingSCD2
#43 Medium

Mixing Facts and Dimensions

Data Modeling
star schemaSCD2viewshistory
#44 Easy

Explaining Fact Table Grain

Data Modeling
grainfactsdimensionsaggregations
#45 Medium

Current State and Full History

Data Modeling
event sourcingprojectionsMVaudit
#46 Medium

Region Suddenly Shows Zero Revenue

Debugging
dashboardjoinsSCDtime zones
#47 Medium

Airflow Green but Output Empty

Debugging
silent successidempotencyanomaly checks
#48 Medium

Query Suddenly 80x Slower

Debugging
EXPLAINstatisticsplan flipjoin strategy
#49 Easy

User Says Data Is Wrong

Debugging
commsvague reportstriage
#50 Medium

Partition Always Ten Percent Smaller

Debugging
anomalybaselinespatternsjudgement
#51 Medium

BigQuery Bill Eight Times Higher

Cost & Performance
INFORMATION_SCHEMAtop queriesslot reservation
#52 Medium

Four Hour Spark Job Under One Hour

Cost & Performance
Spark UIskewAQEbroadcast joins
#53 Easy

Hourly Scan on Daily Data

Cost & Performance
summary tablesMVrefreshBI tool
#54 Medium

Just Throw More Memory At It

Cost & Performance
upsizeplan inspectionoptimization
#55 Easy

Partitioning Clustering Materialized Views

Cost & Performance
partitioningclusteringMVBigQuery
#56 Medium

Watermarks in Plain Words

Streaming
watermarksevent timeallowed lateness
#57 Medium

Kafka Ordering Guarantee

Streaming
Kafkapartition keyorderingidempotent producer
#58 Medium

Streaming Consumer Lag Diagnosis

Streaming
lagback-pressureskewFlink UI
#59 Easy

Onboarding a New Analyst

People & Process
onboardingmentoringpairing
#60 Easy

Metric by Tomorrow vs Doing It Right

People & Process
commsprioritizationmetrics
#61 Medium

Two Teams Disagree on Active User

People & Process
metric ownershipcommsmetrics layer
#62 Medium

Postmortem After a Bad Day

People & Process
postmortemblamelessaction items
#63 Medium

Inherited Pipeline No Docs No Tests

People & Process
ownershipdocstestsexpectations
#64 Medium

Breaking Change in dbt Model 200 Consumers

People & Process
dbtdeprecationcommsrollout
#65 Medium

4000 DAG Airflow at 90 Percent CPU

People & Process
Airflowschedulerparsingscale-out
#66 Easy

Indexes When to Add and When They Hurt

Databases
indexesB-treewrite costEXPLAIN
#67 Easy

Transactions and ACID

Databases
transactionsACIDdurabilityatomicity
#68 Medium

Isolation Levels in Plain Words

Databases
isolationsnapshotanomaliesMVCC
#69 Medium

Normalization and When to Denormalize

Databases
normalization3NFdenormalizationstar schema
#70 Medium

B-Tree vs Hash vs LSM Tree

Databases
B-treehashLSMstorage engines
#71 Medium

Read Replicas and Replication Lag

Databases
replicasreplication lagread after write
#72 Hard

Sharding and Picking a Shard Key

Databases
shardingshard keyhot shardshash
#73 Medium

Database Connection Pooling

Databases
connection poolPgBouncersizingPostgres
#74 Medium

Deadlocks and Lock Escalation

Databases
deadlockslocksretrieslock escalation
#75 Medium

SQL vs NoSQL

Databases
SQLNoSQLKVdocument +2