Problem #16 Easy Networking

Design a Load Balancer

l4l7health checkssticky sessionsalgorithmsfailover

What we are building

A load balancer sits between internet clients and a fleet of backend servers. It picks one backend for each incoming connection, spreads load across the pool, and removes unhealthy backends from rotation within seconds. Every internet-facing product you have used runs behind one.

Concretely: 100 backend servers each handle ~300 req/s. A load balancer in front receives ~30,000 req/s and distributes them across the pool. One backend dies. The LB detects the failure within 15 seconds and stops sending it traffic. TLS terminates at the LB so each backend speaks plain HTTP. The LB also survives the death of one of its own instances.

The problem looks like “just a proxy.” The real work is five hard problems hiding underneath:

L4 vs L7. An L4 LB sees only IP and port. An L7 LB reads the full HTTP request. The choice drives TLS termination, path routing, and how you count connections.
Health check accuracy. A backend returning 200 to /healthz can still be throwing 500s on real traffic. Detecting the difference without flooding the backend with probes is not obvious.
Sticky sessions vs stateless. Cookie-based stickiness is easy to enable and hard to live with. It causes load skew. Stateless backends cost a network hop per request for session data.
Attack traffic. Slow-loris attacks hold connections open without sending data. SYN floods exhaust the connection table. Both look different from L4 and L7 and need different mitigations.
TLS termination cost. At 3,000 new connections per second, TLS handshakes consume roughly six CPU cores. Session resumption, ECDSA certs, and TLS 1.3 each cut that cost by a measurable amount.

We will start with the simplest setup that works. Then we add one problem at a time and watch the design grow.

The lifecycle of one connection

Every request passes through the same small loop. Picture it before drawing any boxes.

stateDiagram-v2
    direction LR
    [*] --> Received: TCP connection accepted
    Received --> PickBackend: health list checked
    PickBackend --> Forwarded: selected backend gets the request
    Forwarded --> Done: response returned to client
    Done --> [*]
    PickBackend --> Rejected: no healthy backends
    Rejected --> [*]

The picking step seems trivial. Everything interesting happens around it: maintaining the health list, handling the case where the selected backend dies mid-request, and preventing a single bad backend from absorbing all retry traffic.

Take this with you. A load balancer has two jobs: pick a backend, and keep the backend list honest. The picking is the easy part.

How big this gets

A realistic internet-facing service at each stage of growth.

Stage	Daily users	Peak req/s	Bandwidth	New TLS/s	Concurrent connections
1	1k	20	~10 Mbps	~2	~100
2	100k	500	~240 Mbps	~50	~2,500
3	1M	5,000	~2.4 Gbps	~500	~25,000
4	10M	30,000	~5-15 Gbps	~3,000	~200,000+

Show: where the numbers come from

Assume average request is 4 KB and average response is 60 KB (~64 KB per request-response pair). Average held connection time: ~5 seconds.

Stage 1 (1k users, 20 req/s): 20 × 64 KB = ~1.3 MB/s = ~10 Mbps. ~100 concurrent connections. TLS: ~2 new handshakes per second, one CPU core is bored. One backend handles everything.

Stage 2 (100k users, 500 req/s): 500 × 64 KB = ~32 MB/s = ~240 Mbps. ~2,500 concurrent connections. TLS: ~50 new/s, fraction of a core. 3-5 backends needed. Health checks start to matter.

Stage 3 (1M users, 5k req/s): ~320 MB/s = ~2.4 Gbps. ~25k connections. TLS: ~500 new/s, roughly one core doing nothing but handshakes. 20-50 backends. Session resumption becomes worthwhile.

Stage 4 (10M users, 30k req/s): ~5+ Gbps. 200k+ concurrent connections (millions with WebSockets). TLS: ~3,000 new/s, six cores for handshakes alone. Hundreds of backends. The LB itself needs to be a cluster, not one box.

Three ceilings that each break first depending on workload:

Bandwidth: a 10 Gbps NIC caps at ~9 Gbps usable. Video or file downloads hit this before TLS becomes a problem.
TLS CPU: each handshake costs 1-3 ms of CPU. IoT and mobile apps with short connection lifetimes hit this first.
Connection count: each kept-alive connection costs file descriptors and ~10 KB of kernel memory. Chat and streaming apps hit this first.

Naming which ceiling is yours is the senior answer.

Take this with you. The LB breaks at three different scales for three different reasons: bandwidth, TLS CPU, and connection count. Which one hits first depends on your traffic shape, not the request volume.

The smallest version that works

One nginx. One backend. nginx terminates TLS and proxies plain HTTP to the backend.

flowchart LR
    U([Browser]):::user --> LB["nginx :443\n(TLS termination)"]:::edge
    LB --> B["Backend :8080"]:::app

    classDef user fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    classDef edge fill:#e2e8f0,stroke:#475569,color:#1e293b
    classDef app fill:#dcfce7,stroke:#15803d,color:#14532d

Two endpoints carry the whole product.

Endpoint	What it does
`GET *`	Proxy any HTTP request to the backend pool
`GET /healthz`	Return 200 if the LB itself is alive (for the layer above this one)

Show: the minimal nginx config

  
upstream api {
    server 10.0.1.10:8080;
}

server {
    listen 443 ssl;
    server_name api.example.com;

    ssl_certificate     /etc/ssl/api.crt;
    ssl_certificate_key /etc/ssl/api.key;

    location / {
        proxy_pass http://api;
        proxy_set_header Host            $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

One upstream, one backend. Round-robin across one server is a no-op, but the config is ready for a second backend the day you need it.

Why put nginx in front from day one? Three reasons. You can swap the backend without changing DNS. You can add a second backend later without restructuring. TLS cert renewal does not require a backend deploy.

This is enough for a few hundred users. The interesting question is what breaks first as traffic grows.

Decision 1: L4 or L7?

Every load balancer decision flows from this one. L4 and L7 are not interchangeable choices with trade-offs. They are different products.

flowchart TB
    subgraph L4["L4 Load Balancer"]
        L4a["Sees: TCP/IP headers only\nsrc IP, dst IP, src port, dst port"]
        L4b["Routes: by IP and port only"]
        L4c["TLS: pass-through or terminate\n(cannot parse URLs without terminating)"]
        L4d["Latency: under 1 ms"]
        L4e["Use when: raw TCP, DDoS scrubbing,\nfast forwarding, database proxying"]
    end

    subgraph L7["L7 Load Balancer"]
        L7a["Sees: full HTTP request\nmethod, path, headers, cookies, body"]
        L7b["Routes: by path, host, cookie, header, query string"]
        L7c["TLS: always terminates\n(must read URLs to route)"]
        L7d["Latency: 1-3 ms"]
        L7e["Use when: internet-facing HTTP,\npath routing, stickiness, sticky sessions"]
    end

An algorithm comparison across three dimensions:

	Round-robin	Least connections	IP hash	Consistent hash
Works for HTTP/1.1	yes	yes (better default)	yes	yes
Works for HTTP/2	no (one conn = many requests)	no (same problem)	no	yes
Handles variable request time	poorly	well	poorly	depends on key
Provides stickiness	no	no	yes (fragile)	yes (stable)
Handles backend add/remove	fine	fine	reshuffles all clients	moves 1/N of clients

The practical defaults:

HTTP/1.1: least connections. Handles variable request times naturally.
HTTP/2: least active requests (count streams, not TCP connections). Envoy calls this LEAST_REQUEST.
Sharded state or cache pools: consistent hash. Adding a backend moves 1/N of keys.
Canary deploys: weighted round-robin. Set new version weight=1, old weight=9 for a 10% canary.

Take this with you. L4 is faster and cheaper. L7 is smart. Pick L7 for any HTTP-aware work (path routing, stickiness, per-route policy). Pick L4 at the outermost edge for DDoS scrubbing and raw TCP throughput.

Decision 2: how do we know a backend is healthy?

The startup grows. Multiple backends. On-call gets paged: backend B has been down for a minute, nginx kept sending traffic to it the whole time. There were no health checks.

A backend has a lifecycle once health checks are running.

stateDiagram-v2
    direction LR
    [*] --> Healthy: registered in pool
    Healthy --> Suspect: errors start
    Suspect --> Ejected: 3 failures in a row
    Suspect --> Healthy: errors stop
    Ejected --> HalfOpen: 30s cooldown
    HalfOpen --> Healthy: probe succeeds
    HalfOpen --> Ejected: probe fails (eject again, longer cooldown)
    Healthy --> [*]: backend removed

Active health checks (LB polls /healthz every 5 seconds) catch process death. They miss silent degradation: a backend returning 200 to /healthz while throwing 500s on real traffic, or responding in 30 seconds instead of 30 ms.

Passive outlier detection fills the gap. After each real request, the LB updates an error rate and a moving average latency per backend. If one backend’s p99 is more than 3x the pool median for 60 seconds, it is a candidate for ejection.

One parameter separates competent designs from dangerous ones: max_ejection_percent. If a bad deploy makes every backend return 500, an aggressive health checker kicks them all out. The site goes dark. Setting max_ejection_percent: 50 means the LB keeps sending traffic somewhere even in the worst case. A partial outage beats a total one.

flowchart LR
    subgraph Pool["Backend pool (4 backends)"]
        A["Backend A\nhealthy"]:::app
        B["Backend B\nhealthy"]:::app
        C["Backend C\nhealthy"]:::app
        D["Backend D\nreturning 500s"]:::bad
    end

    HC["Health checker\nevery 5s: GET /healthz\npassive: 500 rate per backend"] --> A
    HC --> B
    HC --> C
    HC -->|"3 failures → eject"| D
    D2["Backend D (ejected)\nmax_ejection_percent=50\nstill 3 of 4 healthy → ok to eject"]:::bad

    classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
    classDef bad fill:#fecaca,stroke:#b91c1c,color:#7f1d1d

Take this with you. Active checks catch dead processes. Passive checks catch degraded ones. max_ejection_percent is the safety valve that prevents a total outage when the whole pool degrades at once.

Decision 3: sticky sessions or stateless backends?

Some backends need the same user to land on the same box every time. Authenticated sessions held in process memory, an in-progress checkout cart, an open WebSocket. How do you keep that user pinned without losing the LB’s ability to spread load?

Cookie-based stickiness: the LB reads a cookie. If it says srv=backend-B, the request goes to Backend B regardless of load.

The production problem is load skew.

flowchart LR
    Alice(["Alice\n30 req/min"]):::user -->|"cookie: srv=B"| LB["L7 LB"]:::edge
    Bob(["Bob\n2 req/min"]):::user -->|"cookie: srv=A"| LB
    Carol(["Carol\n5 req/min"]):::user -->|"cookie: srv=B"| LB
    Dave(["Dave\n40 req/min"]):::user -->|"cookie: srv=B"| LB

    LB --> A["Backend A\n2 req/min (idle)"]:::app
    LB --> B["Backend B\n75 req/min (hot)"]:::app

    classDef user fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    classDef edge fill:#e2e8f0,stroke:#475569,color:#1e293b
    classDef app fill:#dcfce7,stroke:#15803d,color:#14532d

Alice, Carol, and Dave all stick to Backend B. Bob sticks to A. B is overwhelmed. A is idle. The usual balancing algorithms cannot help: the sticky cookie overrides them.

Three fixes in order of preference for new systems:

Externalize the session. Store sessions in Redis. Backends become stateless. Any backend handles any request. The sticky cookie disappears. One extra network hop (~1-2 ms), zero load skew. This is the right answer for new systems.
Bounded-load stickiness. If the target backend is already above 1.25x average load, re-assign the user and issue a new cookie. Disrupts stickiness for the unlucky few but caps imbalance.
Cap session TTL. After 1 hour the cookie expires. The user gets re-assigned. Periodic rebalancing at the cost of brief disruption.

Show: sticky sessions across multiple LB instances

When you have more than one LB instance, each maintains its own in-memory sticky table. A user might hit LB-1 on their first request and LB-2 on the next. LB-2 has no record of their cookie.

The cleanest fix: encode the backend ID inside the cookie itself (signed and encrypted). Any LB instance decrypts it and routes correctly. No shared state between LB instances.

srv_id = base64(encrypt(backend_id="B", issued_at=..., expires=..., sig=...))

The LB decrypts on every request. Validates expiry. Routes. A tampered cookie fails decryption and the user is treated as new.

Take this with you. Stickiness is easy to enable and hard to operate. New systems should default to stateless backends with an external session store. The one extra Redis hop is cheaper than the operational cost of debugging load skew at 2 a.m.

Decision 4: TLS termination cost

At stage 3 (500 new connections per second), TLS handshakes consume roughly one full CPU core. At stage 4 (3,000/s), six cores. This is not a minor overhead.

Four knobs, cheapest first:

Change	Cost reduction	How
TLS 1.3 session tickets (0-RTT)	~10x for returning clients	Client reuses session from last connection. One trip instead of two.
ECDSA certs (P-256) instead of RSA-2048	~5x	Same security level, much cheaper math. Just a cert swap.
Session ticket cache tuning	Marginal	Make sure the cache is large enough that tickets do not expire before clients return.
Horizontal LB scaling	Linear with instances	Add more LB boxes. Works but costs money.

L4 at the outermost edge, L7 inside the region: the edge L4 terminates TCP and absorbs DDoS without parsing TLS. The L7 inside the region terminates TLS and does path routing. Each box does exactly one thing. A single box trying to do both is slower and harder to scale.

Take this with you. TLS 1.3 with session tickets plus ECDSA certs cuts handshake CPU by roughly 50x compared to RSA-2048 without session resumption. Do both before reaching for more hardware.

Decision 5: surviving attack traffic

Two attacks look different at different layers.

Slow-loris (L7 attack): an attacker opens many connections and sends HTTP headers slowly, one byte every 30 seconds. Each connection holds an nginx worker. With 10,000 of them, no workers are left for real requests.

Fix: client_header_timeout 10s. If a client does not complete its headers in 10 seconds, close the connection. This has no effect on real browsers, which send headers in under 100 ms.

SYN flood (L4 attack): an attacker sends millions of TCP SYN packets with spoofed source IPs. The server allocates a half-open connection slot for each and waits for the ACK that never comes. The SYN backlog fills. Real clients cannot connect.

Fix: SYN cookies. The server encodes connection state in the SYN-ACK reply instead of allocating memory. If the client never ACKs, no resources were wasted. Built into Linux (net.ipv4.tcp_syncookies=1). A dedicated L4 LB at the edge (AWS Shield, Cloudflare) scrubs SYN floods before they reach origin.

flowchart TB
    Internet(["Internet\nclients + attackers"]):::user --> Shield["Edge L4 LB\n(SYN cookies, DDoS scrub)"]:::edge
    Shield -->|"clean TCP connections only"| L7["L7 LB\n(TLS termination\nclient_header_timeout 10s)"]:::edge
    L7 --> Backends["Backend pool"]:::app

    classDef user fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    classDef edge fill:#e2e8f0,stroke:#475569,color:#1e293b
    classDef app fill:#dcfce7,stroke:#15803d,color:#14532d

Take this with you. SYN floods are an L4 problem. Slow-loris is an L7 problem. The mitigations sit at different layers and need different config. Mentioning both without prompting is a senior signal.

The full architecture

Pulling all five decisions together.

flowchart TB
    subgraph GlobalEdge["Global edge"]
        DNS["DNS + anycast IP"]:::edge
    end

    subgraph USEast["Region: us-east"]
        EL4["Edge L4 LB\n(SYN cookies, DDoS scrub)"]:::edge
        L7["L7 LB\n(TLS 1.3 + ECDSA\npath routing\nclient_header_timeout 10s)"]:::edge
        subgraph Pools["Service pools (100 backends total)"]
            API["Orders service\n(40 pods)"]:::app
            Auth["Auth service\n(20 pods)"]:::app
            Web["Web service\n(40 pods)"]:::app
        end
        Redis[("Redis\n(sessions)")]:::cache
        HC["Health checker\nactive: every 5s\npassive: per-request\nmax_ejection=50%"]:::app
    end

    subgraph EUWest["Region: eu-west"]
        EL4_EU["Edge L4 LB"]:::edge
        L7_EU["L7 LB"]:::edge
        Pools_EU["Service pools"]:::app
    end

    C([Browser]):::user --> DNS
    DNS -->|"anycast routes to nearest"| EL4
    DNS --> EL4_EU
    EL4 --> L7
    L7 -->|"/api/orders/*"| API
    L7 -->|"/api/auth/*"| Auth
    L7 -->|"/*"| Web
    API -.session check.-> Redis
    HC --> API
    HC --> Auth
    HC --> Web
    EL4_EU --> L7_EU
    L7_EU --> Pools_EU

    classDef user fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    classDef edge fill:#e2e8f0,stroke:#475569,color:#1e293b
    classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
    classDef cache fill:#fecaca,stroke:#b91c1c,color:#7f1d1d

Each component in one line:

Component	Purpose
DNS + anycast	Routes each user to the nearest healthy region. BGP withdraw = sub-second regional failover.
Edge L4 LB	Terminates TCP, scrubs SYN floods, passes clean connections to L7. Does not parse HTTP.
L7 LB	Terminates TLS, reads URLs, routes by path, applies per-route policy, detects slow clients.
Health checker	Active polls every 5s, passive check per response, max 50% ejection.
Service pools	One pool per service. Each scales independently. Each has its own health check config.
Redis	Holds sessions so backends are stateless. Any backend handles any request.

Walk: a request, end to end

Alice opens the orders page.

sequenceDiagram
    autonumber
    participant Alice
    participant DNS
    participant EdgeL4 as Edge L4 (us-east)
    participant L7 as L7 LB (us-east)
    participant Orders as Orders backend

    Alice->>DNS: resolve api.example.com
    DNS-->>Alice: anycast IP (~1ms, cached)

    rect rgb(241, 245, 249)
        Note over Alice,L7: TCP + TLS setup (paid once per kept-alive connection)
        Alice->>EdgeL4: TCP SYN
        EdgeL4->>L7: 5-tuple hash picks one L7 instance
        Alice->>L7: TLS 1.3 handshake (1-RTT new, 0-RTT returning client ~1ms)
    end

    Alice->>L7: GET /api/orders/123
    L7->>L7: match /api/orders/*, pick backend (least connections, ~0.5ms)
    L7->>Orders: GET /api/orders/123 + X-Forwarded-For: Alice's IP
    Orders-->>L7: 200 OK + JSON (~20ms backend processing)
    L7-->>Alice: 200 OK + JSON
    Note over L7,Orders: in_flight counter decremented, EWMA updated

Latency budget:

Step	Typical
DNS (cached)	< 1 ms
Edge L4 forwarding	< 1 ms
TLS handshake (new)	10-30 ms
TLS resumption (0-RTT)	< 1 ms
L7 parse + route	< 1 ms
Backend processing	5-50 ms

The LB layers together add about 2-5 ms for a returning client. That is the price for path routing, observability, and health management.

The hard sub-problem: cascading failure

One backend is slow: disk latency has ballooned from 1 ms to 30 ms. Active health checks pass (it answers /healthz). Passive detection has not triggered yet.

Round-robin keeps sending it requests. Each takes 30 seconds. Workers pile up. After 100 workers are stuck on this backend, nginx runs out of workers and starts returning 502 to all clients, including the ones routed to healthy backends.

Without protection, this is the failure sequence:

flowchart TD
    A["Backend D: slow disk\nrequests take 30s"] --> B["Round-robin still sends 1/N of requests to D"]
    B --> C["Workers pile up waiting for D\n100 workers stuck, 0 left for other requests"]
    C --> D["LB returns 502 to ALL clients"]:::bad
    D --> E["Clients retry"]
    E --> F["Retries hit other backends\nthey also slow down"]:::bad
    F --> G["All backends ejected or overloaded\n503 everywhere"]:::bad

    classDef bad fill:#fecaca,stroke:#b91c1c,color:#7f1d1d

The defenses, in order of application:

Least connections instead of round-robin. Backend D accumulates in-flight requests. Once it has more than the others, new requests go elsewhere. The skew corrects automatically.
proxy_read_timeout 30s. After 30 seconds the LB gives up, frees the worker, and records a failure against Backend D. The backend’s failure count rises toward ejection threshold.
Passive latency ejection. If Backend D’s p99 is more than 3x the pool median for 60 seconds, mark it as suspect. Combine with max_ejection_percent: 50.
Backend-side circuit breaker. When Backend D’s own dependency (slow disk, slow DB) is sick, the backend should return 503 immediately instead of queuing for 30 seconds. The LB ejects it on the 503 and Backend D recovers faster.
Client retries with backoff. Retries are inevitable. Without jitter, every client retries at the same second. With exponential backoff and jitter, retries spread out over tens of seconds.

Take this with you. Cascading failure is the most common LB production incident. The fix is: least connections + read timeout + passive ejection + max_ejection_percent. All four are needed. Any one alone is not enough.

Follow-up questions

Try answering each in 2-3 sentences before opening the solution.

Sticky sessions and uneven load. You enable cookie-based stickiness. Three power users cluster on Backend B. It runs hot while A and C idle. How do you fix this without losing stickiness?
TLS termination cost. Your LB’s CPU sits at 80% and you trace it to TLS handshakes. What are your options, cheapest first?
HTTP/2 and least connections. You switch backends to HTTP/2. Suddenly almost all traffic goes to one backend. Why, and what do you change?
WebSockets. You add a WebSocket feature. Each user opens one long-lived connection. After deploying, the load is wildly uneven for hours. What happened?
Slow backend starving the pool. One backend has a slow disk. Requests there take 30 seconds instead of 30 ms. Round-robin keeps sending it requests. nginx workers pile up. What algorithm or config fixes this?
DNS TTL. You set DNS TTL to 1 hour. Your LB IP changes during an emergency. Clients still hit the old IP for an hour. What is the right TTL? What is the trade-off?
Cross-region failover. Your us-east region is down. How does traffic get to eu-west? How long does it take? Walk through each layer.
Path-based routing for a monolith split. You are splitting a monolith. /api/orders/* should go to a new order-service. Everything else stays on the monolith. What changes in the LB? How do you migrate without breaking clients?
Health check storm. 200 LB instances each polling 500 backends every 5 seconds = 20,000 /healthz requests per second. How do you cut this down without losing health visibility?
LB dropping connections during deploy. New backends register before they are ready. Old backends are killed mid-request. What is the right deploy sequence?

Distributed Cache (009). Consistent hashing was introduced here. Cache pools use the same ring algorithm to distribute keys across nodes.
Read-Heavy System Patterns (017). The LB sits at the center of the read scaling story. Same algorithms, different traffic shape.
Write-Heavy System Patterns (018). Puts the LB in front of the write path, where stickiness choices affect how partitioning behaves.

Try the problem on your own first. Solutions are most valuable after you've struggled with it.

Solution: Design a Load Balancer

The short version

A load balancer is a proxy that picks a backend for each incoming connection, keeps the backend list honest by removing unhealthy servers within seconds, and distributes traffic so no single backend is overwhelmed. TLS terminates at the LB so backends speak plain HTTP. The LB itself needs to survive the failure of one of its own instances.

The picking algorithm is the easy part. The hard work is health accuracy (an active /healthz check does not detect a backend that answers it while throwing 500s on real traffic), load skew from sticky sessions, TLS handshake CPU at scale, and cascading failure when one slow backend blocks all workers.

The right topology is layered. An L4 edge terminates TCP and scrubs DDoS. An L7 layer inside each region terminates TLS and does path-based routing. DNS or anycast directs clients to the nearest region. Each layer adds 1-3 ms in exchange for one specific control point.

1. The two questions that matter most

Protocol shape. HTTP/2 with multiplexed connections is a completely different load picture from HTTP/1.1 with short connections. WebSockets change it again. The algorithm that works for HTTP/1.1 (least connections) fails silently for HTTP/2 because one TCP connection carries many requests. The algorithm that works for short requests (round-robin) fails for WebSockets because connections live for hours and round-robin only spreads the connects evenly, not the sustained load.

Stickiness requirement. If backends are stateless (sessions in Redis, not memory), every balancing algorithm is available. If backends hold state in memory (shopping carts, WebSocket handles, in-memory caches), every algorithm has a tax: sticky algorithms cause load skew; stateless algorithms cost a network hop per request for session data.

Everything else follows from those two answers.

2. The math, in plain numbers

Stage	Peak req/s	Bandwidth	New TLS/s	Concurrent conns	What hurts
1	20	~10 Mbps	~2	~100	nothing
2	500	~240 Mbps	~50	~2,500	NIC and nginx workers
3	5,000	~2.4 Gbps	~500	~25,000	TLS CPU starts to bite
4	30,000	~5-15 Gbps	~3,000	~200k+	TLS dominates; need LB cluster

Three ceilings:

Bandwidth. A 10 Gbps NIC caps at ~9 Gbps usable. Video and file-heavy workloads hit this before TLS becomes a problem.

TLS handshakes. Each costs 1-3 ms of CPU. At 500/s, roughly one core. At 3,000/s, six cores doing nothing but handshakes. TLS 1.3 with session tickets (0-RTT for returning clients) cuts this by ~10x. ECDSA-P256 certs instead of RSA-2048 cut it by another ~5x. Both together: ~50x reduction, no hardware change.

Connection count. Each kept-alive connection costs file descriptors and ~10 KB of kernel memory. 200k concurrent connections needs Linux kernel tuning (ulimit, net.core.somaxconn) and several GB of RAM just for the socket table.

Which ceiling hits first depends on your workload. Name which one is yours.

3. The control plane API

A load balancer is not a REST API. It is a TCP/HTTP proxy. But every LB needs a control plane to manage its config at runtime, especially for deploys.

GET    /admin/v1/pools                         list all backend pools
GET    /admin/v1/pools/{pool}/backends         list backends with health status
POST   /admin/v1/pools/{pool}/backends         add a backend
DELETE /admin/v1/pools/{pool}/backends/{id}    remove a backend
PUT    /admin/v1/pools/{pool}/backends/{id}    change weight or state

POST   /admin/v1/pools/{pool}/backends/{id}/drain   stop new connections, let in-flight finish
POST   /admin/v1/pools/{pool}/backends/{id}/ready   re-enable after drain or maintenance

GET    /admin/v1/stats                         request rate, errors, latency histograms

The drain endpoint is load-bearing for deploys. Before killing a backend process, call drain. The LB stops sending new connections. In-flight requests finish. After a 30-second grace period, kill the process. Skip this and you drop real requests.

4. The data model

The LB is stateless for routing. All its operational state lives in RAM. There is no database on the hot path.

BackendPool {
    name: string
    algorithm: round_robin | least_conn | ip_hash |
               consistent_hash | ewma | weighted_rr
    health_check: HealthCheckConfig
    backends: List<Backend>
}

Backend {
    id: string                        # "10.0.1.10:8080"
    address: ip + port
    weight: int
    state: healthy | suspect | ejected | draining
    consecutive_failures: int
    consecutive_successes: int
    in_flight_requests: atomic int    # for least_conn / LEAST_REQUEST
    ewma_latency_ms: float            # for EWMA
    last_check_at: timestamp
}

HealthCheckConfig {
    type: http | tcp | grpc
    path: string                      # "/healthz"
    interval_ms: int                  # 5000
    timeout_ms: int                   # 1000
    unhealthy_threshold: int          # 3 consecutive fails = eject
    healthy_threshold: int            # 2 consecutive successes = re-add
    expected_status: List<int>        # [200]
    max_ejection_percent: int         # 50: never eject more than half the pool
}

Three things to defend out loud:

All of this fits in RAM. A pool of 1,000 backends with full state is well under 1 MB. No database, no Zookeeper, no Consul required on the hot path.

Each LB instance has its own independent view of health. If LB-1’s network path to Backend B is broken but LB-2’s path is fine, each should route to whatever it can reach. A shared global “B is unhealthy” opinion would turn a local network partition into a global outage.

Sticky sessions across multiple LB instances require encoding the backend ID inside the cookie itself (signed, encrypted). Any instance decrypts it and routes correctly. No shared state between LB instances.

5. The core engine

stateDiagram-v2
    direction LR
    [*] --> Accept: TCP connection
    Accept --> MatchPool: host or port lookup
    MatchPool --> PickBackend: run algorithm
    PickBackend --> Forward: send request
    Forward --> RecordOutcome: update EWMA, in_flight counter
    RecordOutcome --> [*]: response sent
    Forward --> RetryNext: backend error
    RetryNext --> PickBackend: if retries < max
    RetryNext --> Return502: retries exhausted
    Return502 --> [*]

Show: the pick and forward loop in pseudo-code

  
def handle_request(conn):
    pool = match_pool(conn.host, conn.port)
    retries = 0

    while retries < pool.max_retries:
        backend = pick_backend(pool, conn)
        if backend is None:
            return respond_503(conn, "no healthy backends")

        backend.in_flight.increment()
        try:
            response = forward(conn, backend)
            backend.in_flight.decrement()
            record_outcome(backend, response.status, response.latency_ms)
            return respond(conn, response)
        except BackendError as e:
            backend.in_flight.decrement()
            record_failure(backend, e)
            retries += 1

    return respond_502(conn, "backend exhausted after retries")


def pick_backend(pool, conn):
    healthy = [b for b in pool.backends if b.state == "healthy"]
    if not healthy:
        return None

    if pool.algorithm == "least_conn":
        return min(healthy, key=lambda b: b.in_flight.get())

    elif pool.algorithm == "round_robin":
        return healthy[pool.rr_counter.next() % len(healthy)]

    elif pool.algorithm == "consistent_hash":
        return pool.hash_ring.lookup(conn.routing_key)

    elif pool.algorithm == "ewma":
        return min(healthy, key=lambda b: b.ewma_latency_ms)

The health check loop runs in parallel. It polls /healthz on each backend, updates consecutive failure/success counters, and flips state when thresholds cross. Passive detection runs on every real response: update the EWMA, check if p99 exceeds 3x pool median, eject if so (but never more than max_ejection_percent).

Three things that make this safe in production:

Pick-and-forward tolerates a backend dying between health check intervals. If forwarding raises an error, the LB picks the next backend. Retries are capped so a fully broken pool fails fast rather than hanging forever.

Active and passive checks are complementary. Active catches process death. Passive catches silent degradation: a backend answering 200 to /healthz while returning 500 on real traffic, or responding slowly.

max_ejection_percent is the difference between a partial outage and a total one. The outlier logic refuses to eject if doing so would drop the pool below the floor. Better to keep sending traffic to a sick backend than to send it nowhere.

6. L4 vs L7, resolved

flowchart LR
    subgraph L4["L4 LB"]
        A["Sees: IP + port only"]
        B["Routes: by IP/port"]
        C["TLS: pass-through or terminate"]
        D["Latency: < 1 ms"]
        E["Use: DDoS scrub, raw TCP, database proxying"]
    end

    subgraph L7["L7 LB"]
        F["Sees: full HTTP (method, path, headers, cookies)"]
        G["Routes: path, host, cookie, header, query"]
        H["TLS: always terminates"]
        I["Latency: 1-3 ms"]
        J["Use: internet HTTP, path routing, stickiness"]
    end

Production pattern: L4 at the outermost edge, L7 inside each region.

The edge L4 terminates TCP, absorbs DDoS via SYN cookies and rate limiting, and passes clean connections inward. It does not parse HTTP, which is why it is fast. The L7 inside the region terminates TLS, reads URLs, routes to per-service pools, and injects headers. Each layer does exactly one job. A single box trying to do both is slower and harder to scale independently.

7. The architecture

flowchart TB
    subgraph GlobalEdge["Global edge"]
        DNS["DNS + anycast IP\n(Route 53, Cloudflare)"]:::edge
    end

    subgraph USEast["Region: us-east"]
        EL4["Edge L4 LB\n(SYN cookies, DDoS scrub)"]:::edge
        L7["L7 LB\n(TLS 1.3 + ECDSA\npath routing\nclient_header_timeout 10s)"]:::edge
        subgraph Pools["Service pools"]
            API["Orders service\n(N pods)"]:::app
            Auth["Auth service\n(N pods)"]:::app
            Web["Web service\n(N pods)"]:::app
        end
        Redis[("Redis\n(sessions)")]:::cache
        HC["Health checker\nactive 5s + passive\nmax_ejection=50%"]:::app
    end

    subgraph EUWest["Region: eu-west"]
        EL4_EU["Edge L4 LB"]:::edge
        L7_EU["L7 LB"]:::edge
        Pools_EU["Service pools"]:::app
    end

    C([Browser]):::user --> DNS
    DNS -->|"anycast"| EL4
    DNS --> EL4_EU
    EL4 --> L7
    L7 -->|"/api/orders/*"| API
    L7 -->|"/api/auth/*"| Auth
    L7 -->|"/*"| Web
    API -.session.-> Redis
    HC --> API
    HC --> Auth
    HC --> Web
    EL4_EU --> L7_EU
    L7_EU --> Pools_EU

    classDef user fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    classDef edge fill:#e2e8f0,stroke:#475569,color:#1e293b
    classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
    classDef cache fill:#fecaca,stroke:#b91c1c,color:#7f1d1d

Component	Purpose	Why it exists
DNS + anycast	Routes each user to the nearest healthy region	BGP withdraw = sub-second regional failover
Edge L4 LB	Terminates TCP, absorbs DDoS	Fast; does not parse HTTP; scales to Tbps
L7 LB	Terminates TLS, routes by path, injects headers	Smart routing, per-route policy, observability
Health checker	Active + passive, max 50% ejection	Detects dead and degraded backends
Service pools	One pool per service	Each service scales independently
Redis	Holds sessions	Lets backends be stateless

Take this with you. If the auth service dies, orders and web keep running. Each pool is independent. The LB is the only component that sees all of them.

8. A request, end to end

sequenceDiagram
    autonumber
    participant Alice
    participant DNS
    participant EdgeL4 as Edge L4 (us-east)
    participant L7 as L7 LB (us-east)
    participant Orders as Orders backend

    Alice->>DNS: resolve api.example.com
    DNS-->>Alice: anycast IP (BGP routes to us-east, ~1ms cached)

    rect rgb(241, 245, 249)
        Note over Alice,L7: TCP + TLS setup (paid once per kept-alive connection)
        Alice->>EdgeL4: TCP SYN
        EdgeL4->>L7: 5-tuple hash picks one L7 instance
        Alice->>L7: TLS 1.3 handshake (1-RTT new client, 0-RTT returning ~1ms)
    end

    Alice->>L7: GET /api/orders/123
    L7->>L7: match /api/orders/*, pick backend (least connections, ~0.5ms)
    L7->>Orders: GET /api/orders/123 + X-Forwarded-For: Alice's IP
    Orders-->>L7: 200 OK + JSON (~20ms)
    L7-->>Alice: 200 OK + JSON
    Note over L7,Orders: in_flight decremented, EWMA updated

Latency budget:

Step	Typical
DNS (cached)	< 1 ms
Edge L4 forwarding	< 1 ms
TLS handshake (new client)	10-30 ms
TLS resumption (0-RTT)	< 1 ms
L7 parse + route	< 1 ms
Backend processing	5-50 ms

The LB layers together add ~2-5 ms for a returning client.

9. The scaling journey: 100 users to 1 million

flowchart LR
    S1["Stage 1\n100 users\n1 nginx + 1 backend\n~$5/mo"]:::s1
    S2["Stage 2\n~10k users\n+ real health checks\n+ 3-5 backends\n~$50-100/mo"]:::s2
    S3["Stage 3\n~100k users\n+ two LB instances\n+ path routing\n~$1-5k/mo"]:::s3
    S4["Stage 4\n1M users\n+ anycast global LB\n+ per-region L7\n~$10-100k/mo"]:::s4

    S1 --> S2 --> S3 --> S4

    classDef s1 fill:#e0f2fe,stroke:#0369a1,color:#0c4a6e
    classDef s2 fill:#dcfce7,stroke:#15803d,color:#14532d
    classDef s3 fill:#fef3c7,stroke:#a16207,color:#713f12
    classDef s4 fill:#fce7f3,stroke:#be185d,color:#831843

Stage 1: 100 users

One nginx, one backend. nginx terminates TLS and proxies HTTP. Round-robin across one backend is a no-op, but the config is ready for a second. No health checker (one backend is always up or the site is down for 100 users). No failover.

Stage 2: 10,000 users, 3-5 backends

What broke: the single backend pegs CPU during bursts. Killing it for deploys takes the site down for 30 seconds.

Scale to 3-5 backends. Switch from round-robin to least_conn.
Active health checks every 5 seconds. Eject after 3 failures, re-add after 2 successes.
proxy_next_upstream so the LB retries on a different backend when one fails mid-request.
Drain before killing for deploys.

Not built yet: no second LB instance. A single nginx is still a SPOF. For 10,000 users a rare 30-second failover is acceptable.

Stage 3: 100,000 users, active-active LB, path routing

What broke (several things at once):

TLS CPU on the single nginx climbs. 500 new connections/s x 2 ms = one core just on handshakes.
The single nginx is a SPOF that cannot be accepted.
Monolith is splitting; nginx config is getting unmanageable.
Add a second nginx in active-active with a shared VIP (keepalived/VRRP). ~1-second failover. Or use a managed cloud L7 LB (AWS ALB) which is multi-AZ by default.
Path-based routing: /api/orders/* to orders service, etc.
TLS 1.3 + session tickets. ECDSA-P256 certs. Cut handshake CPU by ~50x.
Move sessions to Redis. Backends become stateless. No sticky sessions needed.

Stage 4: 1,000,000 users, global + regional

What broke:

Users in Asia see 300 ms to us-east.
TLS at 3,000/s costs 6 cores even with session tickets.
Service-to-service traffic crosses the central LB twice per call.
Anycast global LB (Cloudflare, AWS Global Accelerator). Same IP from every region. BGP routes each user to nearest healthy region. Failover in seconds.
Sidecar proxy (Envoy, Linkerd) on every pod. Internal calls skip the central LB. Adds mTLS, retries, and per-call observability.
TLS 1.3 + 0-RTT everywhere. ECDSA everywhere.
Slow start: ramp new backend weight from 0 to 100% over 30 seconds. Cold caches do not get slammed.

10. Reliability

The LB itself dies. Active-passive VIP with keepalived: ~1-second failover, same IP. Active-active anycast: BGP withdraw shifts traffic sub-second. Anycast wins for internet-facing services at scale.

A backend dies. Active health checks notice within 1-3 intervals (5-15 seconds). In-flight requests on the dead backend fail and are retried on a healthy backend via proxy_next_upstream.

The whole pool degrades. max_ejection_percent: 50 stops the LB from ejecting everyone. It returns 503 while someone investigates. Better to tell the truth than forward to nowhere.

A backend is alive but slow. The dangerous case. Active health checks pass. Real requests take 30 seconds. Round-robin keeps sending more.

Fix: least_conn stops sending new requests to the slow backend once it accumulates in-flight. Combine with proxy_read_timeout 30s. Add passive latency ejection: if p99 is more than 3x the pool median for 60 seconds, eject temporarily. Backend-side circuit breaker: when the backend’s own dependency is sick, it should return 503 fast rather than serve slowly.

Cascading failure. Backend A is slow. LB ejects it. Traffic shifts to B and C. They hit 1.5x load and start timing out. LB ejects them too. All backends gone.

The fixes together: max_ejection_percent: 50 (never eject everyone); backend circuit breakers (return 503 fast when overloaded); client retries with exponential backoff and jitter (spread retry bursts over tens of seconds instead of one second).

TLS cert expiry. A cert expiring silently takes the whole service down. Automate renewal (Let’s Encrypt + cert-manager). Alert on certs expiring within 30 days.

11. Observability

Metric	Why it matters
`lb.request.rate` per backend	Uneven values mean the balancing algorithm is not working
`lb.request.error_rate` per backend	One backend spiking = candidate for ejection
`lb.request.latency` p50/p95/p99 per backend	Find slow backends before they cascade
`lb.healthy_backend_count` per pool	Below 50% = page someone
`lb.ejected_backend_count` per pool	Spike = something pool-wide is wrong
`lb.tls.handshakes_per_sec`	TLS CPU ceiling watch
`lb.tls.session_resumption_rate`	Below 70% means tickets are not working
`lb.connection.active_count`	File descriptor and memory pressure
`lb.connection.new_per_sec`	Much lower than request rate means keep-alive is working
`lb.upstream.retry_rate`	High = flaky backends; the LB is masking the problem
`lb.bandwidth.in_out`	NIC saturation watch

Page on: healthy_backend_count < 50% for 1 minute. LB instance unresponsive. TLS error rate > 1%.

Ticket on: latency p99 regression > 30% sustained. ejected_backend_count > 0 for 10 minutes. Bandwidth > 70% of NIC capacity.

12. Follow-up answers

1. Sticky sessions and uneven load.

Three options in order of how much stickiness you give up. First: cap session TTL at 1 hour. Users re-assign periodically. Brief disruption, periodic rebalancing. Second: bounded-load stickiness. If the target backend is at > 1.25x average load, re-assign and re-issue the cookie. Disrupts a few users but caps the imbalance. Third: externalize sessions to Redis. Backends become stateless. Stickiness goes away entirely. Right answer for new systems. Legacy systems often cannot pay the rewrite cost.

2. TLS termination cost, cheapest first.

Enable TLS 1.3 with session tickets (returning clients do 0-RTT, ~10x cheaper; just a config change). Switch to ECDSA-P256 certs instead of RSA-2048 (~5x faster at the same security level; just a cert swap). Tune the session ticket cache size. Scale the LB horizontally. Hardware TLS offload (worth it only above ~10k handshakes/s). Most teams stop at step 2.

3. HTTP/2 and least connections.

With HTTP/2, each client opens one TCP connection and multiplexes many requests over it. least_conn sees every backend has 1 connection. The first backend wins ties consistently and gets all new traffic. Fix: switch to least active requests. Envoy calls this LEAST_REQUEST. It counts in-flight HTTP/2 streams per backend, not TCP connections. If the LB is L4 and cannot parse HTTP/2, move to an L7 LB.

4. WebSockets and uneven distribution.

WebSocket connect is one round trip. After that the LB shuffles bytes. With 10k users and 5 backends on round-robin, the initial spread is even (2k per backend). As users disconnect and reconnect over hours, the spread drifts. If a backend bounces, all its connections reconnect and round-robin slams the next backend in rotation. Fixes: least_conn for the connect step; graceful drain on bounce (close connections in waves over 2 minutes, not all at once); cap max connections per backend; consider a dedicated WebSocket gateway.

5. Slow backend starving the pool.

least_conn is the primary fix. The slow backend accumulates in-flight requests. Once it has more than the others, new requests go elsewhere. Combine with proxy_read_timeout 30s: after 30 seconds the LB gives up and frees the worker. Add passive latency ejection: if p99 is more than 3x pool median for 60 seconds, eject temporarily. Backend-side circuit breaker: when the backend’s own dependency (slow disk, slow DB) is sick, it should return 503 immediately. The LB ejects it and the backend recovers faster.

6. DNS TTL.

Long TTL: fewer DNS queries, slower to react to LB IP changes. Short TTL: faster failover, more DNS load. Production default: 60 seconds. Worst-case failover is 1-2 minutes. For a planned LB IP change: drop TTL to 30s 24 hours ahead so resolver caches have short TTLs by the time you flip. The real fix is anycast: the IP never changes, BGP advertisement just stops from the failed region.

7. Cross-region failover.

With anycast: BGP advertisement from the failed region is withdrawn. Routers learn within seconds. Clients are silently routed to the next-nearest region. Sub-second on good networks, up to 30 seconds on slow-converging paths. With geo-DNS only: Route 53 health checks detect the failure and stop returning the failed region’s IP. Bound by DNS TTL (60 seconds). Clients with cached DNS still hit the dead region for up to TTL seconds. Anycast failover is ~10x faster.

8. Path-based routing for a monolith split.

Add a route to the L7 LB: /api/orders/* to order_service, everything else to monolith. More specific paths first. Migration: deploy order_service in parallel with the monolith. Route 5% of /api/orders/* traffic to the new service using weighted routing. Monitor. Raise to 25%, then 50%, then 100%. Remove order-handling from the monolith. Rollback is a single LB config change.

9. Health check storm.

200 LBs × 500 backends / 5s = 20,000 /healthz requests per second. If checks are deep (DB query), you DoS your own DB. Two fixes: keep /healthz shallow (is the process alive, 200); run deep checks (DB connectivity) separately at 60-second intervals through monitoring, not the LB. Better: switch from pull to push. Envoy subscribes to a service discovery system (xDS) that pushes health updates. The discovery system checks each backend once, not 200 times. In Kubernetes: kubelet runs liveness and readiness probes once per node, and the LB watches Endpoints. Either approach cuts health check load by 100-200x.

10. LB dropping connections during deploy.

Two failure modes: new backends register before they finish warming up (cold cache errors on real requests); old backends are killed mid-request (dropped connections). Correct sequence: new backend starts. Readiness probe /ready returns 503 until initialization completes. LB does not include it until /ready is 200. LB slow-starts the new backend: ramp weight from 0 to 100% over 30 seconds. Old backend drain: mark as draining, stop new connections, let in-flight finish, kill after 30-second grace period. Roll one backend at a time. Never kill more than one simultaneously. Kubernetes handles most of this with readiness probes, preStop hooks, and terminationGracePeriodSeconds.

13. Trade-offs worth saying out loud

Hardware LB vs software LB vs cloud-managed. Hardware (F5, NetScaler): high upfront cost, very high throughput per unit, complex to operate. Common in regulated enterprises. Software (nginx, HAProxy, Envoy): commodity hardware, open source, full control, you operate it. Default for most engineering teams. Cloud-managed (ALB, GCLB): zero operations, pay per request and per GB, less flexibility. Right for cloud-native teams below a certain bandwidth bill. Above tens of TB/month outbound, self-managed often gets cheaper.

Sidecar mesh vs centralized LB. Centralized: one LB tier in front of all services. Easy to reason about, single config point, single failure point. Sidecar: every pod has a proxy, routing decisions are local, full per-call observability, ~50 MB overhead per pod. Sidecar wins for service-to-service traffic inside the cluster. Centralized wins for ingress from the internet. Most large systems use both.

Sticky sessions vs stateless backends. Stickiness is operationally simpler (no shared state to operate). Stateless is operationally cleaner (any algorithm works, any backend serves any request, no load skew). New systems should default to stateless. Legacy systems are often stuck with stickiness.

14. Common mistakes

Treating “load balancer” as a single thing. The LB is layered. Saying “we use nginx” without acknowledging the DNS/anycast layer in front and the sidecar layer behind shows you have not thought about topology.

No mention of L4 vs L7. The difference is the most basic concept question on this topic. Skip it and the interviewer moves on.

Round-robin everywhere. Round-robin is the wrong default for variable-duration requests, HTTP/2, and WebSockets. least_conn (or least active requests for HTTP/2) is a better default.

Ignoring sticky session costs. “We’ll use cookie-based stickiness” without discussing load skew and the backend-failover reshuffle is a junior answer.

No mention of health check storms. At non-trivial scale, 200 LB instances polling 500 backends becomes its own DoS vector. Push-based health via xDS or Kubernetes Endpoints is the senior answer.

Forgetting the LB is a SPOF. Active-active anycast or active-passive VIP. Either works. Not addressing it is not.

TLS handwave. “Terminate at the LB” is correct but incomplete. Mention session tickets, ECDSA certs, 0-RTT, and at scale, when hardware offload is worth considering.

Cascading failure not mentioned. This is the most common LB production incident. max_ejection_percent and backend-side circuit breakers are the answer. Candidates who name both without prompting are at senior level.

No drain step in deploys. Rolling deploys without drain drop requests. A senior candidate names readiness probes, slow start, and drain without being asked.

Treating sidecar mesh as a buzzword. If you mention service mesh, explain why it removes the central LB hop for internal traffic and gives per-service mTLS. Not just that you would use it.

Seven of these ten without prompting is a senior-level answer. The three that matter most: L4/L7 framing, sticky session costs, and health check storms.