dbt Incremental vs Full Refresh
Scenario: A dbt project has 40 models. Most are materialised as table, which means every dbt run rebuilds the whole warehouse from scratch. The largest fact table now has 800 million rows and the daily run takes 90 minutes. The bill is climbing. A teammate suggests changing every model to incremental. You say “not every model, and not the way you think.” You are asked to teach the difference.
In the interview, the question is:
When should a dbt model be incremental, when should it be a full refresh, and what does the
incrementalmaterialisation actually do under the hood?
Your Task:
- Explain the three dbt materialisations (view, table, incremental) in one sentence each.
- Walk through what
is_incremental()does and how a typical incremental model looks. - Cover the three incremental strategies (append, merge, delete+insert) and when each fits.
- Explain when an incremental model is the wrong choice.
What a Good Answer Covers:
MERGEunder the hood, generated by dbt.- The unique key and what happens without one.
- Late-arriving data and the lookback window.
- When a full refresh is still the right answer (small dim tables, model logic changed).
--full-refreshflag and when to run it.
Try the problem on your own first. Solutions are most valuable after you've struggled with it.
Solution 77: dbt Incremental vs Full Refresh
Short version you can say out loud
dbt’s three core materialisations are view (a SQL view, no storage, slow at read time), table (rebuilt fully on every run, fast at read time, expensive at run time), and incremental (rebuilt as “new rows only” on every run, fast at run time, mostly fast at read time). Incremental is right when the model is large, append-mostly, and has a stable filter for “what is new.” It is wrong for small dimensions, for models whose business logic just changed, or for models where rows update in place across the whole history. Under the hood, an incremental model is just a
MERGE(orINSERTplusDELETE) generated by dbt from youris_incremental()block, and the unique key controls whether re-running is safe.
The three materialisations
| Materialisation | What dbt creates | Cost at run | Cost at read | When to use |
|---|---|---|---|---|
view | A SQL view | Free | Slow on big models | Small, rarely queried, always-fresh logic |
table | A new table, rebuilt every run | Pays for the whole rebuild | Fast | Small to medium tables, dims, or where simplicity matters |
incremental | A table, rebuilt only for new rows | Pays only for the new slice | Fast | Large, append-mostly facts |
The teammate’s “change everything to incremental” mistake is the same one juniors make with indexes: applying a sharp tool everywhere instead of where it pays off.
What an incremental model actually looks like
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{{ config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge'
) }}
SELECT
order_id,
customer_id,
amount,
updated_at
FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}
The {% if is_incremental() %} block is the trick. dbt evaluates it differently in two modes:
- On the first run, or when you pass
--full-refresh, the block is skipped. dbt drops the table and inserts everything. - On every subsequent run, the block runs. dbt only reads rows newer than the high watermark, and writes them in via
MERGEon the unique key.
That MERGE is dbt’s generated SQL. It looks like:
1
2
3
4
5
MERGE INTO analytics.orders AS target
USING (SELECT ... FROM stg_orders WHERE updated_at > ...) AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT (...);
The three incremental strategies
flowchart LR
A([New rows from source]):::src --> S{Strategy}:::tx
S -->|append| B[("Append only<br/>no key check")]:::out
S -->|merge| C[("Merge on unique_key<br/>update if exists, insert if not")]:::out
S -->|delete+insert| D[("Delete the partition<br/>then INSERT the new one")]:::out
classDef src fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
classDef tx fill:#fef3c7,stroke:#a16207,color:#713f12
classDef out fill:#dcfce7,stroke:#15803d,color:#14532d
1. append
Simplest. New rows go in, no deduping. Use when the source is guaranteed never to emit the same row twice (immutable logs, click events with a unique event id you trust).
Risk: a retry double-inserts. You need an idempotent source or you wear the duplicates.
2. merge
The default on most warehouses. Uses the unique_key to upsert. Safe to retry. Handles updates to existing rows correctly.
Cost: MERGE does a join under the hood. For very large tables this can be expensive. Use a partition filter (BigQuery: partition_by plus an on_schema_change config) so the merge only touches recent partitions.
3. delete+insert
For partitioned tables. Identify which partitions the new data touches, delete those partitions, then insert. Equivalent to the “delete and reinsert by partition” pattern from problem 9. Safe to retry by construction.
Best for time-partitioned facts where you know “this run only touches the last 3 days.”
The unique key is not optional in real life
If unique_key is missing on a merge model, dbt appends every row. Rerunning a backfill duplicates everything. Almost every dbt outage where “the numbers doubled” is a missing or wrong unique key.
The unique key must be:
- Stable. Surrogate keys,
event_id,order_id. Not “first name plus city.” - Actually unique in the source. If the source has duplicates,
MERGEwill pick one and you will not see the conflict.
For wide composite keys, use a surrogate_key macro to hash them, or list the columns explicitly.
When incremental is the wrong choice
- Small tables. A 10,000-row dim table builds in 2 seconds as
table. Making it incremental adds complexity for no win. - Logic changed. The SQL of an incremental model changed yesterday. Now half the table is on the old logic and half on the new. Run
dbt run --full-refresh -s my_modelto force a rebuild. - Late-arriving data outside your lookback window. If a row updates 30 days after creation and your incremental filter only looks at the last 7 days, you miss it. Two fixes: widen the window, or accept reprocessing the late rows separately.
- Wide updates. If most rows in the history can update on any given day, incremental does not help: every run re-touches the whole table. Stay on
table. - You are still iterating on the model. During development,
tableis honest and fast to reason about. Switch toincrementalonly when the model is stable.
A realistic incremental for the 800M-row fact
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
{{ config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge',
partition_by={'field': 'order_date', 'data_type': 'date'},
on_schema_change='append_new_columns'
) }}
SELECT
order_id,
customer_id,
order_date,
amount,
updated_at
FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) - INTERVAL '2 days' FROM {{ this }})
AND order_date >= (SELECT MAX(order_date) - INTERVAL '7 days' FROM {{ this }})
{% endif %}
Three things this does:
- Picks up updates from the last 2 days (lookback for late rows).
- Restricts the
MERGEto partitions in the last 7 days (cost control). on_schema_change='append_new_columns'keeps the table valid when a new column lands.
When to run --full-refresh
- The model SQL changed in a way that affects old rows (a calculation fix, a renamed column).
- The unique key changed (rare, but it happens).
- You suspect drift between source and warehouse and want a clean rebuild.
Schedule full refresh quarterly for very long-lived incremental models as a sanity check, or write a dbt test that compares aggregate row counts against the source.
Common mistakes interviewers want you to name
- Making every model incremental. Most dbt projects have 80% dim/small models that should stay as
table. - Missing or wrong unique key. The doubled-numbers outage.
- Lookback window too small. Late rows quietly missed.
- Schema change during incremental. Without
on_schema_change, the model breaks or silently drops the new column. - Forgetting
--full-refreshafter a logic change. Half-old, half-new data lives in the warehouse forever.
Bonus follow-up the interviewer might throw
“How would you test an incremental model?”
Three layers:
- Source tests.
uniqueandnot_nullon the unique key. If the source is broken,MERGEsilently picks a winner. - Model tests. Same
uniqueandnot_nullon the model itself. Catches dbt’sMERGEgoing wrong. - A row-count or aggregate sanity test. Compare the model against the source on a sample date. dbt’s
equalitytest or a custom one. This is the only thing that catches a stale lookback window.
Run them on every PR and after every dbt run in production.