Reliability

Retry with exponential backoff and jitter

Why jitter matters more than backoff.

When a downstream call fails, the caller has a choice: give up, or try again. Trying again is usually correct because most failures are transient. But “try again” needs three rules: a cap on how many times, growing pauses between attempts (backoff), and random offsets on those pauses (jitter). Skip any one and you have created a new problem; usually a stampede that the downstream cannot survive.

What a naive retry does to a struggling downstream

Picture 1,000 clients all calling the same downstream at the same time. The downstream has a brief moment of slowness; all 1,000 see an error.

sequenceDiagram
    autonumber
    participant CL as 1,000 clients
    participant DOWN as Downstream

    CL->>DOWN: 1,000 simultaneous calls
    Note over DOWN: brief slowness, all 1,000 fail
    CL-->>DOWN: 1,000 immediate retries
    Note over DOWN: now hit by 1,000 retries on top of normal traffic
    CL-->>DOWN: 1,000 more retries (still failing)
    Note over DOWN: downstream that would have recovered<br/>is now in a death spiral

The downstream never gets a moment to breathe. A short blip becomes a sustained outage because every client is hammering at the same time. This is a retry storm, and it is one of the easiest ways to take down your own infrastructure.

Exponential backoff: wait longer each time

The first fix is to pause before each retry, and to make the pause grow with each attempt. A common formula:

delay = base × 2^attempt

With base = 100 ms: 100 ms, 200 ms, 400 ms, 800 ms, 1600 ms. The intuition: if the first retry did not work, the downstream is probably still in trouble; wait longer before adding more load.

sequenceDiagram
    autonumber
    participant CL as Client
    participant DOWN as Downstream

    CL->>DOWN: call
    DOWN-->>CL: error
    Note over CL: wait 100 ms
    CL->>DOWN: retry 1
    DOWN-->>CL: error
    Note over CL: wait 200 ms
    CL->>DOWN: retry 2
    DOWN-->>CL: error
    Note over CL: wait 400 ms
    CL->>DOWN: retry 3
    DOWN-->>CL: ok

A single client now stops hammering. Good. But put 1,000 of them together and they all retry at exactly the same moments. Same problem, slightly spaced.

Jitter: the part everyone forgets

Add a random offset to each delay. Two clients that fail at the same moment now retry at slightly different times. The downstream stops seeing synchronised waves.

flowchart TB
    subgraph NJ["Backoff without jitter — synchronised waves"]
        direction LR
        N1["t = 0 ms<br/>1,000 fail"]:::dead
        N2["t = 100 ms<br/>1,000 retry"]:::dead
        N3["t = 300 ms<br/>1,000 retry"]:::dead
        N4["t = 700 ms<br/>1,000 retry"]:::dead
    end

    subgraph J["Backoff with full jitter — spread"]
        direction LR
        J1["t = 0 ms<br/>1,000 fail"]:::weak
        J2["t = 0-100 ms<br/>retries spread evenly"]:::ok
        J3["t = 0-300 ms<br/>retries spread evenly"]:::ok
        J4["t = 0-700 ms<br/>retries spread evenly"]:::ok
    end

    classDef dead fill:#fecaca,stroke:#b91c1c,color:#7f1d1d,stroke-width:1.5px
    classDef weak fill:#fed7aa,stroke:#c2410c,color:#7c2d12,stroke-width:1.5px
    classDef ok fill:#dcfce7,stroke:#15803d,color:#14532d,stroke-width:1.5px

The most common jitter formula is full jitter: pick a delay uniformly between zero and the backoff cap.

delay = random(0, base × 2^attempt)

Average wait is similar, but the distribution is smooth instead of spiky. The downstream sees a gradual recovery curve instead of repeated waves.

Jitter matters more than backoff. Backoff helps one client; jitter is what saves the downstream from synchronised retries from many clients.

The full recipe

flowchart TB
    F["call fails"]:::weak
    Q1{"is this error retryable?<br/>(timeouts, 502, 503, network)"}:::query
    NO1["return error.<br/>do not retry validation or auth errors."]:::dead
    Q2{"have we exceeded the retry budget?"}:::query
    NO2["return error.<br/>retries are bounded."]:::dead
    W["wait = random(0, base × 2^attempt)<br/>capped at max"]:::infra
    R["retry the call"]:::strong

    F --> Q1
    Q1 -->|"no"| NO1
    Q1 -->|"yes"| Q2
    Q2 -->|"yes"| NO2
    Q2 -->|"no"| W
    W --> R
    R -->|"still fails"| F

    classDef weak fill:#fed7aa,stroke:#c2410c,color:#7c2d12,stroke-width:1.5px
    classDef query fill:#dbeafe,stroke:#1e40af,color:#1e3a8a,stroke-width:1.5px
    classDef dead fill:#fecaca,stroke:#b91c1c,color:#7f1d1d,stroke-width:1.5px
    classDef infra fill:#fef3c7,stroke:#a16207,color:#713f12,stroke-width:1.5px
    classDef strong fill:#dcfce7,stroke:#15803d,color:#14532d,stroke-width:1.5px

Every retry policy in production should answer these three questions:

Is this error retryable? Network errors, 502, 503, 504, and explicit “retry me” headers: yes. Validation errors, auth errors, business-logic errors: no.
Have we tried enough? A retry budget (e.g., 3 attempts, 30 seconds total) prevents runaway loops.
How long do we wait? Exponential backoff with full jitter, capped at a sensible maximum (often 30 seconds).

Retries and idempotency

Retries replay the request. If the original request created an order, retrying might create a second one. Without idempotency, retries quietly cause real harm: double-charges, double-emails, double-shipments. Idempotency keys are mandatory companions to retries. See Idempotency.

Retries vs circuit breakers

Both fight transient failures, in different ways:

Retries assume the error is brief; trying again will work.
Circuit breakers assume the error is sustained; stop trying so the downstream recovers.

They compose: retry transient errors a few times; if errors persist, the breaker opens and stops retries until the cooldown elapses.

Two scenarios

Scenario one: an API client calling a flaky third-party service.

The third-party returns 503s occasionally for a few seconds. Without retries, every blip becomes a user-visible error. With three retries on a 100 ms base + full jitter, almost all blips are absorbed and the user never sees them.

Scenario two: a background job hitting an internal service that just deployed.

Right after a deploy, the new pods have a 30-second warm-up. Workers that hit them get 502s. With retries + jitter, the workers spread their retries over a few seconds; by the time the budget expires, the new pods are warm and serving. Without jitter, the workers all retry at the same moments and overwhelm the slowest-warming pod.

What this connects to

Circuit breaker. The complementary pattern; retries handle blips, breakers handle outages. See Circuit breaker.
Idempotency. Required by anything that retries. See Idempotency.
Bulkheads and rate limiting. A retry storm overwhelms downstream; rate limiting and bulkheads contain the damage. See Bulkheads and rate limiting.
Delivery semantics. At-least-once delivery means messages get retried; consumers must be idempotent. See Delivery semantics.

Common mistakes

Retry without backoff. A retry storm in one line of code.
Backoff without jitter. Synchronised waves; multiple clients turn one outage into a sustained one.
No retry budget. Failed call retries forever. Threads and connections pile up.
Retrying non-idempotent operations. Two charges. Two emails. Two shipments.
Retrying every error. A 400 Bad Request will be 400 Bad Request next time. Validation errors are not transient.
Server-side retries on top of client-side retries. Each layer multiplies; one user request becomes 27 downstream calls. Pick one place to retry.
No metric on retry rate. If 50% of your calls retry, you have a downstream problem hiding behind apparent success. Measure.

Quick recap

Retry on transient errors (timeouts, 502, 503), not on validation or auth errors.
Exponential backoff prevents one client from hammering.
Jitter prevents many clients from hammering in unison.
Cap retries with a budget; the alternative is runaway loops.
Always idempotent. Always.

This concept sits in Stage 4 (Scaling and reliability) of the System Design Roadmap.

Last updated May 29, 2026