DE Concepts
Plain-English answers to the questions that show up on every data team.
A growing reference library. The hard parts of data engineering distilled into short, scenario-driven explanations: SQL, modeling, file formats, batch, streaming, orchestration, quality, observability, cost, and the cloud trade-offs. Use it alongside the roadmap, or as a quick lookup before an interview.
SQL Foundations
7 topicsReading an EXPLAIN plan
The fastest way to know why your query is slow.
Window functions
Compute over a group without collapsing the rows.
CTEs vs subqueries vs temp tables
Three ways to split a query. Picking the right one.
The seven JOIN shapes (and which one you actually wanted)
Inner, left, right, full, cross, semi, anti — each has a job.
NULL semantics: three-valued logic
NULL is not 'nothing'. It is 'unknown'. The bugs that follow.
Set operations: UNION, INTERSECT, EXCEPT
Stacking results vertically when JOIN is the wrong tool.
Recursive CTEs and lateral joins
Two SQL features that solve problems nothing else can solve cleanly.
Data Modeling
8 topicsStar schema vs snowflake schema
One fact table in the middle. Dimensions flat or normalised.
Fact tables and dimension tables
Things that happen, vs things that describe them.
Grain: the unspoken hard part of modeling
What one row in this table means. Get this wrong and nothing else works.
Slowly Changing Dimensions (Type 1, 2, 3, 4, 6)
When a customer moves city, what do old orders say?
Surrogate keys vs natural keys
When the warehouse owns the key, vs when the source does.
Conformed dimensions
One 'customer' table, used by every fact table that needs one.
One Big Table vs normalised
The pattern modern warehouses keep nudging you toward.
Data Vault modeling
Hubs, links, and satellites. When dimensional modeling is not enough.
File Formats & Storage
8 topicsRow-oriented vs column-oriented storage
Two ways to lay bytes on disk. One for transactions, one for analytics.
Parquet, deeply
The columnar file format every analytics tool reads. What's inside it.
ORC vs Parquet vs Avro
Three binary formats. Each one wins in a specific job.
Delta Lake vs Apache Iceberg vs Apache Hudi
Three table formats over Parquet files. They give you transactions, time travel, and schema evolution.
Partitioning vs bucketing vs clustering
Three ways to organise files so queries scan less.
Compression: Snappy vs Gzip vs Zstd vs LZ4
Trade CPU for I/O. Pick the codec that matches your bottleneck.
Schema evolution in columnar formats
Add a column. Rename a column. Drop a column. Without rewriting the world.
The small-files problem
Ten thousand tiny Parquet files will outscore your query even if the data is tiny.
Batch Processing
8 topicsETL vs ELT (and why ELT won)
Extract, transform, load. Or extract, load, transform. The order matters.
Idempotent batch jobs
Running the same job twice produces the same answer. The whole platform changes when this is true.
Backfill strategies
Reprocessing 90 days of history without melting the warehouse or the budget.
Full refresh vs incremental vs CDC-driven loads
Three ways to keep a table up to date. Each one fits a specific shape of source.
Shuffle: why it dominates your job runtime
When data crosses the network between executors, you are paying for it.
Skew handling in distributed batch
One key has 90% of the rows. The job waits on one executor while the rest idle.
Broadcast joins vs shuffle joins
Send a small table to every executor. Avoid the shuffle entirely.
UDFs: the hidden costs in Spark and SQL warehouses
User-defined functions look free. They are not.
Stream Processing
7 topicsEvent time vs processing time
The two clocks every streaming system juggles. Get them wrong and everything else breaks.
Watermarks: the unintuitive part of streaming
How a streaming system decides it has 'seen enough' of a time window to compute a result.
Windowing: tumbling, sliding, session
Three ways to chop an unbounded stream into bounded chunks.
Exactly-once in streaming: what it actually means
The phrase is a marketing word until you read the fine print.
Stateful vs stateless streams
Some streaming jobs remember nothing. Others remember everything. The cost is different.
Reprocessing and replay
Re-running a stream against history, without breaking the live one.
Flink vs Spark Structured Streaming vs Kafka Streams
Three real choices for stream processing. Each one wins in a specific shape of team.
Orchestration
6 topicsDAGs and scheduling 101
Directed Acyclic Graph. Tasks with dependencies, run in order, no cycles allowed.
Idempotent tasks in orchestration
The same property as idempotent batch jobs, applied per orchestrator task.
Backfills inside an orchestrator
Triggering 90 historical runs without breaking the live schedule.
Dependency types: data, time, external
Three reasons one task waits for another.
Airflow vs Dagster vs Prefect
Three orchestrators. Each one represents a different design philosophy.
Sensors vs triggers vs event-driven
Three ways to start a job when something arrives, with very different operational costs.
Data Quality & Contracts
6 topicsSchema tests: not null, unique, FK, accepted values
The four tests that catch 80% of data bugs before a dashboard sees them.
Freshness tests and SLAs on data
Is the data new enough? The check that catches silent pipeline failures.
Volume tests and anomaly detection
Yesterday: 5 million rows. Today: 1.2 million. Something broke. Catch it before users notice.
dbt tests: singular, generic, custom
Three ways to write a data test in dbt. Each has a job.
Great Expectations vs Soda vs dbt
Three data quality tools. They overlap, but they were built for different jobs.
Data contracts between teams
An agreement, in code, about what a producer promises and what consumers can assume.
Observability & SLOs
5 topicsWhat data observability means
The five signals that tell you the data is healthy. The vocabulary the field landed on.
Data lineage and how teams actually use it
When a dashboard is wrong, can you trace it back to the source in under 5 minutes?
Cost-per-query attribution
Knowing which team, which dashboard, which model is burning the budget.
Slow query attribution
When the warehouse is slow, finding the one query that did it.
SLI / SLO / error budgets for data
The SRE vocabulary, applied to pipelines. The 'how fresh, how correct, how much is acceptable' contract.
Cost & FinOps
5 topicsPartitioning for cost (not just performance)
The right partition column cuts the bill by 10x. The wrong one doubles it.
Storage tiers: hot, warm, cold
Pay more for fast access, less for slow access. Move data between tiers.
Compute autoscaling for warehouses
Spin up when queries arrive, spin down when they don't. Pay for what you use.
Reserved vs on-demand pricing for warehouses
Commit ahead of time for a discount. The math is real, and so is the lock-in.
The 'scan less' rule
Every byte you don't scan is a byte you don't pay for. The whole cost game in one sentence.
Security & Privacy
5 topicsRow-level security in warehouses
Sales people see their own region. Managers see their teams. Same table, different rows per user.
Column masking and dynamic data masking
Email addresses show as a***@example.com unless you're allowed to see the real thing.
PII tokenisation and pseudonymisation
Replace the email with a stable random token. Keep the analytics. Lose the leak.
GDPR right-to-delete in a columnar warehouse
A user asks to be forgotten. You have their data in 47 tables. Now what?
Data residency in multi-region warehouses
EU data stays in the EU. US data stays in the US. The warehouse must enforce it.
Architecture Patterns
5 topicsMedallion architecture: bronze, silver, gold
Three layers, each one cleaner and more refined than the last.
Data lakehouse: the pattern, not the brand
Lake-style storage, warehouse-style transactions. The two worlds, finally fused.
Data mesh: when it works and when it doesn't
Domain teams own their data as products. Central platform team owns the rails.
dbt project structure that scales
How to lay out 500 models so people can still find anything.
CDC-driven architectures
When the warehouse mirrors the production database, change for change, in seconds.
Cloud Comparisons
7 topicsWarehouses: Snowflake vs BigQuery vs Redshift vs Databricks SQL
Four real choices for an analytical warehouse. Each one fits a different shape of team.
Lakehouse engines: Databricks vs Snowflake vs BigQuery
Three vendors all calling their thing a lakehouse. They mean different things.
Managed ETL: Fivetran vs Airbyte vs Stitch
Don't build your own Salesforce connector. Pay someone, or use the open-source one.
Reverse ETL: Hightouch vs Census
Take the modelled, cleaned warehouse data and push it back into Salesforce, Braze, Intercom, the ad networks.
Managed orchestration: Astronomer vs Dagster Cloud vs Prefect Cloud
You can run Airflow yourself. Most teams shouldn't.
dbt Cloud vs self-hosted dbt Core
Same dbt-core under the hood. Different runtime, different cost, different team experience.
Streaming platforms: Confluent Cloud vs MSK vs Redpanda
Kafka, with three different operational shapes. Pick the one that matches the team you actually have.
No topics match these filters
Try a different search term or clear the filters.