Data Engineering
Data Engineering Track
Real production problems, not toy puzzles
65 hand-picked scenarios inspired by what actually breaks in data pipelines, late streams, schema drift, silent ETL failures, cost incidents, debugging the wrong number. Filter by category, difficulty or topic and start solving.
Start here
Data Engineering Roadmap — Beginner to Expert
A staged learning path from SQL fluency to running data systems at scale. Six months, in order. Each stage ends with something you build, not a quiz.
#1 Easy
Log File Error Analysis
Logs and Monitoring
file streamingcounterstop-NIoT logs
#2 Easy
Rolling Average of Sensor Readings
Streaming
rolling windowdequeIoT sensorsreal-time
#3 Medium
Transform and Clean Raw Data for Analytics
Data Cleaning
CSVvalidationregexdate checks
#4 Medium
Schema Evolution and Validation for Streaming Events
Schema Validation
JSONschema evolutiontype coercionpydantic
#5 Medium
Merging Messy CSVs from Multiple Partners
Data Integration
CSVcolumn mappingdate parsingfile walk
#6 Easy
Partitioning vs Clustering in BigQuery
Fundamentals
BigQuerypartitioningclusteringcost
#7 Easy
ETL vs ELT and Why ELT Won
Fundamentals
ETLELTdbtwarehouse
#8 Easy
OLTP vs OLAP
Fundamentals
OLTPOLAPcolumn storerow store
#9 Medium
Idempotency in Data Pipelines
Fundamentals
idempotencyretriesMERGEpartitions
#10 Medium
Slowly Changing Dimensions
Fundamentals
SCDdimensionshistorydbt snapshot
#11 Medium
Data Contracts in Plain Words
Fundamentals
data contractsschema registryownership
#12 Easy
Parquet vs CSV vs JSON
Fundamentals
ParquetCSVJSONcolumnar storage
#13 Medium
Data Lake vs Warehouse vs Lakehouse
Fundamentals
lakewarehouselakehouseIceberg +1
#14 Medium
Exactly Once Delivery
Fundamentals
exactly onceidempotencyKafkastreaming
#15 Medium
Teaching SQL Performance to a Junior
SQL Thinking
EXPLAINperformancementoringoptimization
#16 Medium
SELECT DISTINCT Hiding Join Bugs
SQL Thinking
DISTINCTjoinsgrainsemi-join
#17 Medium
Reading an EXPLAIN Plan
SQL Thinking
EXPLAINquery planjoinssort spill
#18 Medium
CTE vs Subquery
SQL Thinking
CTEsubquerymaterializationrecursion
#19 Medium
Same Query Different Answers
SQL Thinking
time zonesRLSsession settingsdebugging
#20 Medium
Window Functions vs GROUP BY
SQL Thinking
window functionsGROUP BYrunning totalsranking
#21 Hard
Data Platform for an Electricity Retailer
System Design
smart meterIoTwarehousebatch
#22 Hard
Banking App Monthly Spending Widget
System Design
streamingCDCserving storelow latency
#23 Hard
Ride Hailing Surge Pricing
System Design
streamingH3real-timepricing
#24 Hard
Spotify Minutes Listened This Week
System Design
streaming aggregationKV storewatermarks
#25 Hard
Smart Meter to Monthly Bill PDF
System Design
billingSCD2idempotencyaudit
#26 Hard
Delivery Idle Driver Tracking
System Design
streamingH3TTLgeospatial
#27 Medium
Year in Review Recap
System Design
batchKV storeCDNimage render
#28 Medium
Low Balance Notification Pipeline
System Design
batchidempotencytime zonesnotifications
#29 Medium
Daily Report Quietly Wrong for Two Weeks
Scenarios
incidentpostmortemcommsdata quality
#30 Medium
Warehouse Cost Doubled in Two Months
Scenarios
costgovernancecommsINFORMATION_SCHEMA
#31 Easy
The Dashboard is Wrong
Scenarios
trustcommsvague reports
#32 Medium
Inheriting a Pipeline No One Owns
Scenarios
ownershipjudgementrewrite-or-not
#33 Medium
Executive Needs a Number Tomorrow
Scenarios
commsexeccaveatsprioritization
#34 Hard
Three Days of Data Lost
Scenarios
Kafka retentionreplayrecoverypostmortem
#35 Medium
Lambda vs Cloud Function vs Cloud Run
Cloud Decisions
serverlessAWSGCPruntime limits
#36 Easy
Scheduled Pipeline Pay Only When Run
Cloud Decisions
scheduled jobsCloud Run JobsAWS Batch
#37 Medium
BigQuery vs Snowflake for New Team
Cloud Decisions
BigQuerySnowflakepricing model
#38 Easy
Store Partner Files in S3 or Warehouse
Cloud Decisions
S3raw layerauditschema evolution
#39 Medium
Managed Airflow vs Self Hosted
Cloud Decisions
AirflowMWAAComposerAstronomer +1
#40 Medium
BigQuery Access Control for 50 Person Company
Cloud Decisions
IAMdatasetsgroupsRLS +1
#41 Medium
Tables for an Airbnb Like App
Data Modeling
star schemaSCD2multi-currencyreviews
#42 Medium
Tracking Subscription Plan History
Data Modeling
historyvalid_from/tobillingSCD2
#43 Medium
Mixing Facts and Dimensions
Data Modeling
star schemaSCD2viewshistory
#44 Easy
Explaining Fact Table Grain
Data Modeling
grainfactsdimensionsaggregations
#45 Medium
Current State and Full History
Data Modeling
event sourcingprojectionsMVaudit
#46 Medium
Region Suddenly Shows Zero Revenue
Debugging
dashboardjoinsSCDtime zones
#47 Medium
Airflow Green but Output Empty
Debugging
silent successidempotencyanomaly checks
#48 Medium
Query Suddenly 80x Slower
Debugging
EXPLAINstatisticsplan flipjoin strategy
#49 Easy
User Says Data Is Wrong
Debugging
commsvague reportstriage
#50 Medium
Partition Always Ten Percent Smaller
Debugging
anomalybaselinespatternsjudgement
#51 Medium
BigQuery Bill Eight Times Higher
Cost & Performance
INFORMATION_SCHEMAtop queriesslot reservation
#52 Medium
Four Hour Spark Job Under One Hour
Cost & Performance
Spark UIskewAQEbroadcast joins
#53 Easy
Hourly Scan on Daily Data
Cost & Performance
summary tablesMVrefreshBI tool
#54 Medium
Just Throw More Memory At It
Cost & Performance
upsizeplan inspectionoptimization
#55 Easy
Partitioning Clustering Materialized Views
Cost & Performance
partitioningclusteringMVBigQuery
#56 Medium
Watermarks in Plain Words
Streaming
watermarksevent timeallowed lateness
#57 Medium
Kafka Ordering Guarantee
Streaming
Kafkapartition keyorderingidempotent producer
#58 Medium
Streaming Consumer Lag Diagnosis
Streaming
lagback-pressureskewFlink UI
#59 Easy
Onboarding a New Analyst
People & Process
onboardingmentoringpairing
#60 Easy
Metric by Tomorrow vs Doing It Right
People & Process
commsprioritizationmetrics
#61 Medium
Two Teams Disagree on Active User
People & Process
metric ownershipcommsmetrics layer
#62 Medium
Postmortem After a Bad Day
People & Process
postmortemblamelessaction items
#63 Medium
Inherited Pipeline No Docs No Tests
People & Process
ownershipdocstestsexpectations
#64 Medium
Breaking Change in dbt Model 200 Consumers
People & Process
dbtdeprecationcommsrollout
#65 Medium
4000 DAG Airflow at 90 Percent CPU
People & Process
Airflowschedulerparsingscale-out
#66 Easy
Indexes When to Add and When They Hurt
Databases
indexesB-treewrite costEXPLAIN
#67 Easy
Transactions and ACID
Databases
transactionsACIDdurabilityatomicity
#68 Medium
Isolation Levels in Plain Words
Databases
isolationsnapshotanomaliesMVCC
#69 Medium
Normalization and When to Denormalize
Databases
normalization3NFdenormalizationstar schema
#70 Medium
B-Tree vs Hash vs LSM Tree
Databases
B-treehashLSMstorage engines
#71 Medium
Read Replicas and Replication Lag
Databases
replicasreplication lagread after write
#72 Hard
Sharding and Picking a Shard Key
Databases
shardingshard keyhot shardshash
#73 Medium
Database Connection Pooling
Databases
connection poolPgBouncersizingPostgres
#74 Medium
Deadlocks and Lock Escalation
Databases
deadlockslocksretrieslock escalation
#75 Medium
SQL vs NoSQL
Databases
SQLNoSQLKVdocument +2
No problems match these filters
Try removing a filter or clearing your search.