Data Engineering Concepts

Plain-English answers to the questions that show up on every data team.

A growing reference library. The hard parts of data engineering distilled into short, scenario-driven explanations: SQL, modeling, file formats, batch, streaming, orchestration, quality, observability, cost, and the cloud trade-offs. Use it alongside the roadmap, or as a quick lookup before an interview.

Plain-English answers to the questions that show up on every data team.

Reading an EXPLAIN plan

Window functions

CTEs vs subqueries vs temp tables

The seven JOIN shapes (and which one you actually wanted)

NULL semantics: three-valued logic

Set operations: UNION, INTERSECT, EXCEPT

Recursive CTEs and LATERAL joins

Star schema vs snowflake schema

Fact tables and dimension tables

Grain: the unspoken hard part of modeling

Slowly Changing Dimensions (Type 1, 2, 3, 6)

Surrogate keys vs natural keys

Conformed dimensions

One Big Table vs normalised

Data Vault modeling

Row-oriented vs column-oriented storage

Parquet, deeply

ORC vs Parquet vs Avro

Delta Lake vs Apache Iceberg vs Apache Hudi

Partitioning vs bucketing vs clustering

Compression: Snappy vs Gzip vs Zstd vs LZ4

Schema evolution in columnar formats

The small-files problem

ETL vs ELT (and why ELT won)

Idempotent batch jobs

Backfill strategies

Full refresh vs incremental vs CDC-driven loads

Shuffle: why it dominates your job runtime

Skew handling in distributed batch

Broadcast joins vs shuffle joins

UDFs: the hidden costs in Spark and SQL warehouses

Event time vs processing time

Watermarks: the unintuitive part of streaming

Windowing: tumbling, sliding, session

Exactly-once in streaming: what it actually means

Stateful vs stateless streams

Reprocessing and replay

Flink vs Spark Structured Streaming vs Kafka Streams

DAGs and scheduling 101

Idempotent tasks in orchestration

Backfills inside an orchestrator

Dependency types: data, time, external

Airflow vs Dagster vs Prefect

Sensors vs triggers vs event-driven

Schema tests: not null, unique, FK, accepted values

Freshness tests and SLAs on data

Volume tests and anomaly detection

dbt tests: singular, generic, custom

Great Expectations vs Soda vs dbt

Data contracts between teams

What data observability means

Data lineage and how teams actually use it

Cost-per-query attribution

Slow query attribution

SLI / SLO / error budgets for data

Partitioning for cost (not just performance)

Storage tiers: hot, warm, cold

Compute autoscaling for warehouses

Reserved vs on-demand pricing for warehouses

The 'scan less' rule

Row-level security in warehouses

Column masking and dynamic data masking

PII tokenisation and pseudonymisation

GDPR right-to-delete in a columnar warehouse

Data residency in multi-region warehouses

Medallion architecture: bronze, silver, gold

Data lakehouse: the pattern, not the brand

Data mesh: when it works and when it doesn't

dbt project structure that scales

CDC-driven architectures

Warehouses: Snowflake vs BigQuery vs Redshift vs Databricks SQL

Lakehouse engines: Databricks vs Snowflake vs BigQuery

Managed ETL: Fivetran vs Airbyte vs Stitch

Reverse ETL: Hightouch vs Census

Managed orchestration: Astronomer vs Dagster Cloud vs Prefect Cloud

dbt Cloud vs self-hosted dbt Core

Streaming platforms: Confluent Cloud vs MSK vs Redpanda

No topics match these filters

Trending Tags