Health checks: liveness vs readiness vs startup
What each one really signals.
Health checks are how an orchestrator (Kubernetes, a load balancer, a service mesh) decides whether your process is doing its job. The three flavours sound similar but answer different questions, and conflating them is responsible for a famous class of self-inflicted outages: the “liveness probe restarts a perfectly healthy but slow pod every 30 seconds, forever.” Knowing which check to run where, and what each one really means, prevents a category of incidents that take longer to find than to fix.
The three questions, three checks
flowchart TB
subgraph S["Startup probe — 'am I done initialising?'"]
direction LR
S1[("answered once at boot")]:::infra
S2[("'no' means: keep waiting, do not kill me yet")]:::strong
end
subgraph L["Liveness probe — 'am I still alive?'"]
direction LR
L1[("answered periodically forever")]:::infra
L2[("'no' means: I am wedged, restart me")]:::weak
end
subgraph R["Readiness probe — 'can I serve traffic right now?'"]
direction LR
R1[("answered periodically forever")]:::infra
R2[("'no' means: take me out of the load balancer for now")]:::strong
end
classDef infra fill:#fef3c7,stroke:#a16207,color:#713f12,stroke-width:1.5px
classDef strong fill:#dcfce7,stroke:#15803d,color:#14532d,stroke-width:1.5px
classDef weak fill:#fed7aa,stroke:#c2410c,color:#7c2d12,stroke-width:1.5px
- Startup: “I am still booting; do not test me with the other probes yet.”
- Liveness: “I am alive enough to be useful.” A failure here kills the process.
- Readiness: “I can serve traffic right now.” A failure here just removes the instance from the load balancer pool, no restart.
The actions are very different: liveness leads to a restart; readiness leads to a removal. Picking the wrong one for the wrong reason causes loops.
A typical pod lifecycle, with all three
sequenceDiagram
autonumber
participant K as Kubelet
participant P as Pod
P->>P: starts booting (cache warm, DB connections, ...)
Note over K,P: startup probe loop until success
loop until startup succeeds
K->>P: GET /startupz
P-->>K: 503 still warming
end
K->>P: GET /startupz
P-->>K: 200 ready to be probed
Note over K,P: now liveness and readiness probes begin
K->>P: GET /livez
P-->>K: 200 alive
K->>P: GET /readyz
P-->>K: 200 ready
Note over K: add pod to load balancer
Note over K,P: steady state
K->>P: GET /livez (every 10s)
K->>P: GET /readyz (every 5s)
Note over P: temporary downstream issue
K->>P: GET /readyz
P-->>K: 503 downstream failing
Note over K: remove from LB but do not restart
Note over P: downstream recovers
K->>P: GET /readyz
P-->>K: 200
Note over K: add back to LB
The pod was never killed. It was just shielded from traffic while the downstream was sick. That is exactly what readiness was designed for.
The famous failure: liveness that is too clever
A common mistake: liveness probes that check downstream dependencies. “I am alive if and only if I can reach the database.” When the database is briefly slow, every pod fails liveness, every pod gets restarted, the restart storm makes things worse.
sequenceDiagram
autonumber
participant K as Kubelet
participant P1 as Pod 1
participant DB as Database (slow)
K->>P1: liveness check
P1->>DB: ping db (timeout)
DB-->>P1: timeout
P1-->>K: 503 (livez "fails")
K->>P1: KILL — restart
Note over P1: 30 second restart cycle
Note over K,DB: same thing happens to every other pod
Note over DB: now all pods are restarting,<br/>making the DB problem worse
Liveness should answer only “is my process wedged?” Things like: an internal deadlock, a thread that has spun for 60 seconds, a watchdog that has not been kicked. Downstream health belongs in readiness, where the recovery is “wait”, not “kill.”
What each probe should actually check
flowchart TB
subgraph SC["Startup — checks specific to boot"]
direction LR
SC1["caches warmed?"]:::ok
SC2["initial DB schema check?"]:::ok
SC3["essential config loaded?"]:::ok
end
subgraph LC["Liveness — checks the process is not wedged"]
direction LR
LC1["the HTTP server is responding"]:::ok
LC2["internal watchdog has been ticking"]:::ok
LC3["no thread deadlock detected"]:::ok
end
subgraph RC["Readiness — checks dependencies needed right now"]
direction LR
RC1["database reachable"]:::ok
RC2["required downstream healthy"]:::ok
RC3["circuit breaker not open"]:::ok
end
classDef ok fill:#dcfce7,stroke:#15803d,color:#14532d,stroke-width:1.5px
The rule of thumb: if the check failing should make the orchestrator restart you, put it in liveness; otherwise put it in readiness.
Probe budgets and what to tune
- Period. How often the probe runs. Too frequent = wasted load on the app. Typical: 5-15 seconds.
- Failure threshold. How many failures in a row before the orchestrator acts. Typical: 3.
- Timeout. How long to wait for a response. Should be short relative to the period.
- Initial delay. When liveness/readiness should start. Replaced by startup probes in modern Kubernetes; for systems without startup probes, set this generously.
The Kubernetes default is 10s period, 3 failures, which means a wedged pod takes ~30 seconds to restart. Aggressive defaults cause flapping; lax defaults mean slow recovery from real problems.
Two scenarios
Scenario one: a backend service depending on Redis.
Liveness checks an internal deadlock detector. Readiness checks “can I reach Redis.” If Redis has a brief blip, readiness fails, the pod is taken out of the load balancer, traffic goes to other pods. Liveness still passes; the pod is not restarted. Redis recovers, readiness passes, the pod returns. No restart storm.
Scenario two: a Java service that takes 90 seconds to JIT-warm before it can serve a request without 30 seconds of latency spikes.
Startup probe with a 5-minute success deadline. While startup probe is failing, neither liveness nor readiness runs. Once startup succeeds, the other two start. The pod never gets killed for “taking too long to boot.”
What this connects to
- Load balancer basics. Readiness probes are what the load balancer uses to decide pool membership. See Load balancer: why, how, when.
- Circuit breaker. A breaker that is open is a sign your readiness probe should report 503. See Circuit breaker.
- Observability. Probe failures and restarts are critical signals. See Observability: metrics, logs, traces.
- Leader election. A leader that has lost its lease should fail readiness. See Leader election.
- Graceful degradation. Sometimes you stay ready and serve a degraded response instead of failing readiness. See Graceful degradation.
Common mistakes
- Liveness checks downstream dependencies. Restart storms are guaranteed. Move that logic to readiness.
- No startup probe for slow-booting services. Liveness fires while the pod is still warming, killing it before it can answer.
- Same endpoint for all three. Liveness and readiness should be different endpoints with different semantics.
- Returning 200 from
/healthzno matter what. Then the probe is meaningless. A liveness probe that always passes is worse than no probe. - Probe takes too long. A
/livezthat itself takes 2 seconds runs every 10 seconds, eating 20% of an event loop. Probes should be cheap. - Aggressive failure thresholds. One missed probe = restart means transient network blips cause needless restarts. Three failures in a row is typical.
- No probe metric. Restart loops happen quietly. Always emit a count of probe failures and restarts.
Quick recap
- Startup probe: “I am still warming.” Runs only at boot.
- Liveness probe: “my process is wedged.” Failing it kills the process.
- Readiness probe: “I cannot serve traffic right now.” Failing it removes from LB pool.
- Never check downstream health in liveness; it belongs in readiness.
- Probes should be cheap, separate, and meaningful.
This concept sits in Stage 4 (Scaling and reliability) of the System Design Roadmap.