Design a Rate Limiter
What we are building
A rate limiter sits in front of an API and enforces per-client request quotas. A SaaS company might give free-tier users 100 requests per minute, pro users 10,000 per minute, and enterprise users 100,000 per minute. The API gateway runs on 200 nodes. Every one of those nodes must enforce the same limit for the same client. When Alice sends request number 101 in the same minute, any node she hits must reject it.
That is the whole product. The implementation is where things get interesting.
There are five real problems hiding in this design:
- Shared counter. With 200 nodes each keeping their own count, Alice can send 100 requests to each and pass 20,000 through. The counter must be shared.
- Fail-open vs fail-closed. When the counter store goes down, do you let requests through or block them? The answer depends on the route, not the system.
- Burst handling. “100 requests per minute” does not say anything about the first second. Alice could send all 100 in the first millisecond. Most algorithms handle this differently.
- Latency budget. The rate check must cost under 2ms at P99. Every design choice is constrained by that number.
- Hot tenant. One enterprise client sends 80% of the total traffic. Their counter is hit 240,000 times per minute. The counter store must not become the bottleneck for everyone else.
We will start with the simplest version that works, then add one layer at a time as each problem appears.
The lifecycle of one request
Every incoming request goes through the same short path.
stateDiagram-v2
direction LR
[*] --> Arrives: request comes in
Arrives --> CheckCount: look up counter
CheckCount --> Allowed: count < limit
CheckCount --> Rejected: count >= limit
Allowed --> Incremented: bump counter
Incremented --> [*]
Rejected --> [*]
Everything we add later (shared Redis, local caches, fail-open modes, burst allowances) is a complication on top of this single flow.
Take this with you. A rate limiter is a counter lookup, an increment, and a yes-or-no decision. All the engineering is in making those three steps fast, shared, and fault-tolerant.
How big this gets
Assume the following numbers.
| Input | Value |
|---|---|
| Peak QPS | 300,000 |
| Distinct API keys | 100,000 |
| Free tier limit | 100 req/min |
| Pro tier limit | 1,000 req/min |
| Enterprise tier limit | 10,000 req/min |
| Gateway nodes | 200 |
| Latency budget for the check | 2ms at P99 |
Show: what the numbers mean
Counter writes per second. Every request bumps a counter. 300K QPS at peak means 300K writes per second to start. Keep a few counters per client (per-minute, per-hour, per-route) and it becomes roughly 1 million Redis ops per second at peak.
Memory for all the counters. 100K clients x 3 counters each x 100 bytes per counter = 30 MB. Tiny. Fits on one Redis node. We shard for availability, not capacity.
Per-node QPS. 300K / 200 nodes = 1,500 checks per second per gateway. Easy in-process. The risk is not CPU. It is the number of Redis round trips.
Redis ops if every check is a round trip. 300K req/sec x 2 ops per check (read + conditional increment) = 600K ops/sec. One Redis instance handles about 100K ops/sec comfortably. So we either shard, or cut the number of round trips, or both.
The state is small. The challenge is how often we touch the store, not how much we store.
Take this with you. State is not the problem. Round trips are. Every design decision that follows is about cutting the number of Redis calls.
The smallest version that works
Forget 200 gateway nodes. One gateway. One Redis. One table of limits.
flowchart LR
C([Alice's app]):::user --> GW["Gateway<br/>(check + increment)"]:::app
GW --> R[("Redis<br/>counter")]:::cache
GW --> API["API service"]:::app
classDef user fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
classDef cache fill:#fecaca,stroke:#b91c1c,color:#7f1d1d
sequenceDiagram
autonumber
participant Alice as Alice's app
participant GW as Gateway
participant R as Redis
participant API as API service
Alice->>GW: GET /api/search
GW->>R: GET counter for Alice's key
R-->>GW: count = 47
alt count < limit
GW->>R: INCR counter
GW->>API: forward request
API-->>GW: 200 OK
GW-->>Alice: 200 + X-RateLimit-Remaining: 52
else count >= limit
GW-->>Alice: 429 Too Many Requests, Retry-After: 28
end
That is the whole product. Everything we add from here fixes a specific problem.
Take this with you. Start from this. The interesting part of the interview is what breaks the moment you add the second gateway node.
Decision 1: where does the counter live?
The second gateway node breaks everything. Each node has its own in-memory counter. Alice sends 100 requests: 50 land on node 1, 50 land on node 2. Both see 50. Neither blocks her. With 200 nodes, Alice can pass 20,000 through a 100-request limit.
A counter that is not shared is not a rate limiter.
flowchart TB
Alice([Alice]):::user --> LB["Load Balancer"]:::edge
LB --> GW1["Gateway 1<br/>(count: 50)"]:::app
LB --> GW2["Gateway 2<br/>(count: 50)"]:::app
GW1 --> Prob["Problem: 50 + 50 = 100<br/>Neither node blocks her"]:::bad
GW2 --> Prob
classDef user fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
classDef edge fill:#e2e8f0,stroke:#475569,color:#1e293b
classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
classDef bad fill:#fecaca,stroke:#b91c1c,color:#7f1d1d
Move the counter to Redis. All 200 gateways share one counter per client. The math stays honest.
But now every request makes a Redis call. At 300K QPS that is 300K Redis ops per second. One Redis node handles about 100K comfortably. We need sharding.
flowchart TB
GW["Gateway (200 nodes)"]:::app --> R1[("Redis shard 1<br/>keys a-f")]:::cache
GW --> R2[("Redis shard 2<br/>keys g-m")]:::cache
GW --> R3[("Redis shard 3<br/>keys n-z")]:::cache
classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
classDef cache fill:#fecaca,stroke:#b91c1c,color:#7f1d1d
One client’s counters always land on the same shard. The atomic Lua script (which reads current-window and previous-window keys together) works because both keys are on the same shard.
Show: the atomic check-and-increment Lua script
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- KEYS[1] = current window key e.g. "rl:apikey:sk_xyz:search:1716381660"
-- KEYS[2] = previous window key e.g. "rl:apikey:sk_xyz:search:1716381600"
-- ARGV[1] = limit
-- ARGV[2] = window_seconds
-- ARGV[3] = now_ms
-- ARGV[4] = cost (default 1)
-- Returns: { allowed (0|1), remaining, retry_after_ms }
local limit = tonumber(ARGV[1])
local window_seconds = tonumber(ARGV[2])
local now_ms = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])
local current = tonumber(redis.call('GET', KEYS[1])) or 0
local previous = tonumber(redis.call('GET', KEYS[2])) or 0
local window_ms = window_seconds * 1000
local elapsed_ms = now_ms % window_ms
local prev_weight = (window_ms - elapsed_ms) / window_ms
local estimated = current + math.floor(previous * prev_weight)
if estimated + cost > limit then
local need_to_shed = (estimated + cost) - limit
local retry_after_ms = math.ceil((need_to_shed / math.max(previous, 1)) * window_ms)
return { 0, 0, retry_after_ms }
end
local new_current = redis.call('INCRBY', KEYS[1], cost)
redis.call('EXPIRE', KEYS[1], window_seconds * 2)
local remaining = math.max(0, limit - (estimated + cost))
return { 1, remaining, 0 }
One Redis round trip per check. All reads and the conditional write happen inside one Lua execution, so two concurrent requests cannot both read count = 99 and both increment to 100.
Take this with you. Shared state is required. The question is how to share it without making every API call slow. Shard Redis by client key. Use a Lua script for atomic check-and-increment.
Decision 2: which algorithm?
There are five common rate-limiting algorithms. Each handles burst traffic differently.
flowchart TD
Start([Which algorithm?]) --> Q1{"Need exact accuracy?"}
Q1 -- yes, small N --> SWLog["Sliding Window Log"]:::ok
Q1 -- no, approx ok --> Q2{"Want explicit burst ceiling?"}
Q2 -- yes --> TB["Token Bucket"]:::ok
Q2 -- no --> Q3{"Care about boundary doubling?"}
Q3 -- no, simple is fine --> FW["Fixed Window"]
Q3 -- yes --> SWC["Sliding Window Counter"]:::ok
classDef ok fill:#dcfce7,stroke:#15803d,color:#14532d
Show: algorithm comparison
| Algorithm | Memory per client | Burst behavior | Boundary problem | When to pick it |
|---|---|---|---|---|
| Token bucket | ~16 bytes | Explicit: bucket_size burst, then throttled to R/sec | None | When you want a burst ceiling with a steady rate after |
| Leaky bucket | ~16 bytes + queue | Smooths spikes by queuing them | None | Network shaping, not APIs: queuing adds latency |
| Fixed window | ~16 bytes | Burst at start of window | Doubles at boundary: 100 at 11:59 + 100 at 12:00 = 200 in 2s | Advisory limits where doubling is acceptable |
| Sliding window log | 8 bytes per request | Ideal | None | Low-traffic, small N, high-value endpoints |
| Sliding window counter | ~24 bytes | Close to ideal, no jump | ~1% inaccuracy vs true sliding | Almost everything else |
The sliding window counter is the right default. O(1) memory, no boundary doubling, fits in a 15-line Lua script, 1% inaccuracy is invisible.
How it works: keep two fixed-window counters (current and previous). Weight the previous by how much time has passed.
1
estimated = current + previous * (1 - elapsed / window)
At 30 seconds into the current minute, with 80 requests in the previous minute and 40 in the current:
1
estimated = 40 + 80 * (1 - 30/60) = 40 + 40 = 80
No boundary jump. As the window turns over, the previous minute’s contribution decays from 100% to 0% smoothly.
Token bucket is a strong second choice. AWS, Stripe, and most public clouds use it. Pick it when you want an explicit burst: “200 requests in the first second, then 1 per second after.”
Take this with you. Sliding window counter for most cases. Token bucket when you need an explicit burst ceiling. Fixed window only if boundary doubling is acceptable.
Decision 3: how do we stay inside the 2ms latency budget?
At 300K QPS, even a 1ms Redis call is a problem. The clients most over their limit are also the ones sending the most traffic, which means the most Redis calls come from exactly the clients you want to stop fast.
The fix: add a tiny in-process LRU cache on each gateway. If a key was over-limit 100ms ago, fast-fail without calling Redis at all.
flowchart TB
C([Client]):::user --> GW["Gateway"]:::app
GW --> LRU["Local LRU<br/>(10K entries, 100ms TTL)"]:::cache
LRU -.cache miss.-> R[("Redis Cluster<br/>(sharded)")]:::cache
LRU -.over limit cached.-> Rej["429 immediately<br/>(no Redis call)"]:::bad
classDef user fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
classDef cache fill:#fecaca,stroke:#b91c1c,color:#7f1d1d
classDef bad fill:#fecaca,stroke:#b91c1c,color:#7f1d1d
For a client hammering at 100x their limit, only the first few requests per window hit Redis. Everything else fast-fails on the local cache. Redis traffic drops by 80-90% for exactly the clients who would otherwise flood it.
Latency per check:
| Path | P99 |
|---|---|
| Local LRU hit (over-limit fast-fail) | ~0.3ms |
| Local LRU miss, Redis hit | ~2ms |
| Local LRU miss, Redis miss (cold key) | ~3ms |
Take this with you. The local LRU is the highest-impact optimization for the lowest cost. The abusers you most want to stop are the ones least likely to touch Redis after the first rejection.
Decision 4: what happens when Redis goes down?
Most candidates skip this. It is where senior answers separate from mid-level ones.
Three failure modes:
One Redis shard fails. Its replica gets promoted. There is a 10-30 second window where some counters serve stale data. Acceptable. Redis Sentinel or Cluster handles this automatically.
One gateway loses its Redis connection. Others still have it. That node should fail its health check and stop receiving traffic. Serving traffic from a node that cannot enforce limits is worse than pulling it out of rotation.
The whole Redis cluster is down. This is the interesting one.
flowchart LR
Down["Redis cluster<br/>unreachable"]:::bad --> Q{"Route class?"}
Q -->|"reads: /search, /list"| FO["Fail-open<br/>let requests through"]:::ok
Q -->|"writes: /pay, /transfer"| FC["Fail-closed<br/>return 503"]:::bad
Q -->|"auth: /login"| FC2["Fail-closed<br/>brute-force protection<br/>is the whole point"]:::bad
Q -->|"callbacks: /webhook"| FW["Fail-open<br/>with stricter local cap"]:::ok
classDef bad fill:#fecaca,stroke:#b91c1c,color:#7f1d1d
classDef ok fill:#dcfce7,stroke:#15803d,color:#14532d
The fail mode is set per route, not globally. Engineering and security sign off on any fail-closed route. Whatever mode you pick, make the failure loud: emit limiter.degraded_mode metrics from every gateway, page immediately.
Take this with you. “What happens when Redis dies?” is the question. “We just use Redis” is not an answer. The fail mode is a business decision per route, not a technical one.
The full architecture
Putting the four decisions together:
flowchart TB
subgraph Edge["Client edge"]
C([Web / SDK / batch job]):::user
LB["Load Balancer / WAF<br/>(TLS, IP-level blocking)"]:::edge
end
subgraph GWLayer["Gateway layer (200 nodes)"]
GW["Gateway<br/>(limiter middleware + local LRU)"]:::app
end
subgraph State["Counter state"]
R[("Redis Cluster<br/>6 shards · 1 replica each<br/>sharded by client key")]:::cache
CFG[("Config store<br/>Postgres · tiers · overrides")]:::db
end
subgraph Async["Async path"]
K{{"Kafka"}}:::queue
CH[("ClickHouse<br/>decision log · forensics")]:::db
end
API["API Service"]:::app
C --> LB
LB --> GW
GW --> R
GW -.config refresh 60s.-> CFG
GW -->|allowed| API
GW -.reject event.-> K
K --> CH
classDef user fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
classDef edge fill:#e2e8f0,stroke:#475569,color:#1e293b
classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
classDef db fill:#fed7aa,stroke:#c2410c,color:#7c2d12
classDef cache fill:#fecaca,stroke:#b91c1c,color:#7f1d1d
classDef queue fill:#ddd6fe,stroke:#6d28d9,color:#4c1d95
Each component in one line:
| Component | Purpose |
|---|---|
| Load Balancer / WAF | TLS termination, IP-level blocking, distributes to gateway nodes |
| Gateway + limiter middleware | Runs the check. Library, not a separate service. |
| Local LRU | Fast-fails known-over-limit keys. Cuts Redis load 80%+ for abusers. |
| Redis Cluster | Shared counters. Sharded by client key. Lua script for atomic check-and-increment. |
| Config store | Tiers, per-client overrides, route costs. Refreshed every 60 seconds per gateway. |
| Kafka + ClickHouse | Every rejection event. Answers “why was customer X blocked at 3pm yesterday?” |
Notice what is not a separate service: the limiter itself. It is middleware inside the gateway process. A separate limiter service adds one network hop per API call, which blows the 2ms latency budget.
Walk: an allowed request, end to end
Alice has a pro account. Limit is 1,000 req/min. She is at count 847. She sends a search request (cost = 5 units).
sequenceDiagram
autonumber
participant Alice
participant LB as Load Balancer
participant GW as Gateway
participant LRU as Local LRU
participant R as Redis
participant API as API Service
Alice->>LB: GET /api/search (key: sk_pro_alice)
LB->>GW: forward
GW->>LRU: check sk_pro_alice:search
LRU-->>GW: not cached
rect rgb(241, 245, 249)
Note over GW,R: one atomic Lua call
GW->>R: EVALSHA(script, cur_key, prev_key, limit=1000, now_ms, cost=5)
R-->>GW: allowed=1, remaining=148, retry_after_ms=0
end
GW->>LRU: update cache (count=852, exp=100ms from now)
GW->>API: forward request
API-->>GW: 200 OK
GW-->>Alice: 200 + X-RateLimit-Remaining: 148
Now Alice sends one more request after hitting the limit exactly:
sequenceDiagram
autonumber
participant Alice
participant GW as Gateway
participant LRU as Local LRU
Alice->>GW: GET /api/search
GW->>LRU: check sk_pro_alice:search
LRU-->>GW: count=1000, over limit (cached 40ms ago)
GW-->>Alice: 429 Too Many Requests, Retry-After: 12
Note over GW: no Redis call
The second flow never touches Redis. The LRU absorbed it. This is what cuts Redis load during an abuse spike.
The hot tenant problem
Bob is an enterprise customer at 100,000 req/min. His counter key gets hit 1,667 times per second. The one Redis shard that owns his key is at 80% CPU. Everyone else whose keys hash to the same shard is also slowed down.
flowchart TD
Start(["Bob: 1,667 counter hits/sec"]) --> LRU{"Local LRU<br/>hit?"}
LRU -->|"~70% of checks<br/>(100ms TTL)"| OK1["Fast-fail from memory<br/>~500 Redis ops/sec reach shard"]:::ok
LRU -->|"LRU miss"| Shard{"Redis shard<br/>under load?"}
Shard -->|"normal"| Script["EVALSHA<br/>~2ms"]:::ok
Shard -->|"overloaded"| Replica["Route read to<br/>replica<br/>~3ms"]:::ok
Replica -->|"still hot"| Shadow["Shadow counter on<br/>second shard<br/>split load 2x"]:::ok
classDef ok fill:#dcfce7,stroke:#15803d,color:#14532d
Three mitigations, in order of cost:
| Mitigation | What it solves | Cost |
|---|---|---|
| Local LRU (already built) | Cuts 70%+ of checks from hitting the shard | Free |
| Redis read replicas | Distributes read pressure across N replicas | Low |
| Shadow counter on a second shard | Halves writes to the hot shard | Medium: adds a small counting error |
For predictable bursts (an enterprise batch job starts at 2am), pre-announce: the customer registers the burst window and the system raises their local LRU TTL to 1 second for that window, reducing shard traffic further.
Take this with you. One large tenant can saturate a Redis shard and drag everyone else on it. The local LRU is the first defense. Replicas for reads. Shadow counters only if the shard is still overloaded.
Follow-up questions
Try answering each in 2 or 3 sentences before opening the solution.
Multiple keys per request. A request has an IP, an API key, and a user_id. Do you check all three limits in parallel or one after another? What if the API key limit allows it but the IP limit does not?
Limits per route.
/api/searchis expensive./api/healthis cheap. Same client, same tier. How do you express this in the data model?Cost-based limits. Instead of “1,000 requests per minute,” the limit is “1,000 units per minute.” A search costs 5 units. A lookup costs 1. What changes in the limiter?
Burst allowance. A client should be able to send 200 in the first 10 seconds, but still hit a total of 1,000 per minute. Which algorithm? How do you configure it?
IP rotation (botnet). An attacker rotates through 10,000 different home IPs. Per-IP limits do nothing. What do you do?
Customer override. An enterprise customer negotiated a 100x limit. Where do you store the override? How fresh must it be? What if the override service is down?
Distributed accuracy. Two requests from the same client land on two gateway nodes within 1ms. Both check Redis. Both see count = 99 (under a 100 limit). Both INCR. Now count = 101 and both were allowed. What stops this?
Pre-warming a known burst. A customer says they will run a batch job at 2 AM that needs 10,000 requests in 60 seconds, well above their normal limit. How do you let them through without raising their permanent limit?
“User got popular.” A blog post hits the front page. A user’s API endpoint sees 10x traffic. The limiter blocks them. Is that correct? How would you build a smarter signal?
The limiter is the bottleneck. Your dashboard shows the limiter middleware adds 8ms to P99 latency. Your budget was 2ms. Where do you look first?
Related problems
- URL Shortener (001). The
POST /linksendpoint sits behind exactly this kind of limiter. The design plugs in directly. - Distributed Cache (009). The Redis cluster backing the limiter is a distributed cache. The eviction, sharding, and replication choices there set the limiter’s failure modes.
- News Feed (002). Both the
POST /postsendpoint and the timeline read path need rate limiting. The cost-based pattern in follow-up 3 is exactly how feeds throttle expensive ranking queries.
Try the problem on your own first. Solutions are most valuable after you've struggled with it.
Solution: Rate Limiter
What this system is
A rate limiter is a counter lookup with a yes-or-no decision. It sits inside the API gateway as middleware, not as a separate service. It enforces per-client quotas by keeping a shared counter in Redis, checking it atomically on every request, and returning 429 when the client is over their limit.
The algorithm is the easy part. What actually matters:
- Where does the check run? In-process middleware on each gateway, not a separate service. A separate service adds one network hop per API call and blows the latency budget.
- How do 200 nodes share state? Redis Cluster, one atomic Lua script per check, sharded by client key.
- What happens when Redis dies? Configurable per route. Public reads fail-open. Payment and auth endpoints fail-closed.
- How do you keep the check under 2ms? A tiny per-node LRU cache that fast-fails over-limit clients without touching Redis.
The best algorithm for almost every case is the sliding window counter: O(1) memory, no boundary doubling, 15 lines of Lua. Token bucket is a strong second when you need explicit burst ceilings.
1. The two questions that matter most
If you only get to ask two clarifying questions, ask these.
What do you limit on? Per IP? Per API key? Per user? All three? Per-IP fails immediately against corporate NAT and IP rotation. Per-API-key is the right default. Layered keys (IP + API key + user) catch more abuse vectors.
What happens when the counter store is down? This is a business question. Public read APIs usually fail-open (enforce nothing, keep serving). Payment and auth endpoints fail-closed (stop serving rather than allow abuse). If you do not have an answer to this, you have not finished the design.
2. The math
| Number | Value | Why it matters |
|---|---|---|
| Peak QPS | 300,000 | Drives Redis sizing |
| Counter writes/sec | 300K to 1M (multiple counters per client) | One Redis handles ~100K/sec, so sharding is required |
| Total state | ~30 MB | Tiny. Shard for availability, not capacity |
| Per-node QPS | 1,500 | Easy in-process. The cost is round trips, not compute |
| Wire bytes per check | ~120 | Negligible. Round trips are what hurt |
State is small. Requests are many. Every request must touch the limiter. Everything else is a consequence of those three facts.
3. The API and headers
The limiter is internal middleware. It does not have a public API. It produces headers on every API response.
Allowed:
1
2
3
4
HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 847
X-RateLimit-Reset: 1716381660
Rejected:
1
2
3
4
5
6
7
8
9
10
11
12
13
HTTP/1.1 429 Too Many Requests
Retry-After: 28
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1716381660
Content-Type: application/json
{
"error": "rate_limited",
"message": "Request exceeds rate limit of 1000 requests per minute.",
"retry_after_seconds": 28,
"doc_url": "https://api.example.com/docs/rate-limits"
}
Four choices worth defending:
| Choice | Reason |
|---|---|
| 429, not 503 | 503 means “the server is overloaded, try anywhere.” 429 means “you are over your limit, others are fine.” |
Retry-After is required | Well-written SDKs respect it. Sending it costs nothing. |
| Headers on success too | Good clients self-throttle. They never reach 429. |
| Include a doc URL | The single most effective way to cut support tickets. |
4. The data model
Two storage layers. Live counters in Redis. Config in Postgres.
Live counters (Redis keys):
1
rl:{scope}:{identifier}:{route_class}:{window_start_ts}
Examples:
1
2
3
rl:apikey:sk_live_xyz789:search:1716381600 -> integer count, TTL 120s
rl:ip:198.51.100.42:default:1716381600 -> integer count, TTL 120s
rl:user:user_42:write:1716381600 -> integer count, TTL 120s
The sliding window counter reads two keys per check: current window and previous window. TTL is 2x the window length so the previous key is still readable. A key without TTL is a memory leak.
Config store (Postgres):
Show: the config schema
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
CREATE TABLE rate_limit_tiers (
tier_name VARCHAR(32) PRIMARY KEY,
default_limit INTEGER NOT NULL,
window_seconds INTEGER NOT NULL,
burst_multiplier NUMERIC(3,2) NOT NULL DEFAULT 1.00
);
CREATE TABLE rate_limit_overrides (
api_key VARCHAR(64) NOT NULL,
route_class VARCHAR(32) NOT NULL,
limit_value INTEGER NOT NULL,
window_seconds INTEGER NOT NULL,
valid_until TIMESTAMPTZ,
reason TEXT,
PRIMARY KEY (api_key, route_class)
);
CREATE TABLE rate_limit_route_costs (
route_pattern VARCHAR(128) NOT NULL,
route_class VARCHAR(32) NOT NULL,
cost_units INTEGER NOT NULL DEFAULT 1,
PRIMARY KEY (route_pattern)
);
Three things doing real work:
- Tiers are a tiny static table. Each gateway loads it at startup and refreshes every 60 seconds. A tier change takes up to one minute to spread. Acceptable: limits are not security boundaries.
- Overrides are keyed by (api_key, route_class). An enterprise customer with raised search limits but default write limits is two rows, not one JSON blob.
- Route costs are in a separate table. Pricing team owns costs. Sales team owns overrides. Separate tables prevent merge conflicts and confused ownership.
5. The algorithm
The sliding window counter is the right default. Keep two fixed-window counters (current window, previous window) and compute a weighted blend.
If we are 30 seconds into the current minute, with 80 requests in the previous minute and 40 in the current:
1
2
3
estimated = current + previous * (1 - elapsed / window)
= 40 + 80 * (1 - 30/60)
= 80
No boundary jump. As time passes, the previous window’s contribution decays from 100% to 0% smoothly.
stateDiagram-v2
direction LR
[*] --> HasTokens
HasTokens --> HasTokens: refill at R/sec, capped at bucket_size
HasTokens --> Empty: last token taken
Empty --> HasTokens: refill drips in
HasTokens --> Allowed: request arrives, take 1 token
Empty --> Rejected: request arrives, no tokens, 429
Allowed --> [*]
Rejected --> [*]
The diagram above is the token bucket state machine, shown as the second option. Use it when you need an explicit burst ceiling: “200 requests in the first second, then 1/second after.” Two knobs: bucket_size (max burst) and refill_rate (sustained throughput).
Key properties of the Lua implementation (full script in question.md):
- One Redis round trip per check (two GETs + one conditional INCRBY, all inside one Lua execution).
- Atomic at the shard level. Two concurrent requests cannot both pass the check at count = limit - 1.
- Clock from the caller, not Redis. Prevents window mismatches during failover.
costparameter supports cost-based limits at zero extra Redis calls.
6. How the gateway resolves a check
The limiter is in-process middleware. Per-request flow:
- Auth middleware extracts
(api_key, user_id, ip)from the request. - Limiter computes rate limit keys: typically 2-4 per request (per-API-key, per-IP, per-user, per-API-key-per-route).
- For each key, check the local LRU. If any key is over-limit and the cache entry is fresh, reject immediately. No Redis call.
- If no fast-path hit, run all checks against Redis in one pipelined batch: one EVALSHA per key, one round trip total.
- Take the most restrictive result. If any key is rejected, the request is rejected.
- Update the local LRU with the new counts.
- If allowed: forward to the API service, attach
X-RateLimit-*headers to the response. - If rejected: return 429, emit a rejection event to Kafka.
Show: gateway-side limiter code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
SCRIPT_HASH = redis.script_load(LUA_SCRIPT)
class Limiter:
def __init__(self, redis_client, local_cache_size=10000):
self.redis = redis_client
self.lru = LRUCache(local_cache_size)
def check(self, api_key, route_class, cost=1):
tier = config.get_tier_for(api_key)
limit, window_s = tier.limit, tier.window_seconds
now_ms = time.time_ns() // 1_000_000
window_start = (now_ms // 1000 // window_s) * window_s
prev_start = window_start - window_s
cur_key = f"rl:apikey:{api_key}:{route_class}:{window_start}"
prev_key = f"rl:apikey:{api_key}:{route_class}:{prev_start}"
cached = self.lru.get(cur_key)
if cached and cached.count_estimate >= limit and cached.exp_ms > now_ms:
return RateLimitResult(allowed=False, remaining=0,
retry_after_ms=cached.exp_ms - now_ms)
try:
allowed, remaining, retry_ms = self.redis.evalsha(
SCRIPT_HASH, 2, cur_key, prev_key,
limit, window_s, now_ms, cost)
except RedisError:
return self._fail_mode(api_key, route_class, cost)
self.lru.set(cur_key, CachedCount(
count_estimate=limit - remaining,
exp_ms=now_ms + 100
))
return RateLimitResult(allowed=bool(allowed),
remaining=remaining,
retry_after_ms=retry_ms)
def _fail_mode(self, api_key, route_class, cost):
if config.fail_open_for(route_class):
metrics.increment("limiter.degraded.fail_open")
return RateLimitResult(allowed=True, remaining=-1, retry_after_ms=0)
else:
metrics.increment("limiter.degraded.fail_closed")
return RateLimitResult(allowed=False, remaining=0, retry_after_ms=5000)
The LRU has a 100ms TTL per entry. Short enough that a real recovery is honored. Long enough to absorb the next 5-10 requests from an abusive client without hitting Redis.
7. The architecture
flowchart TB
subgraph Edge["Client edge"]
C([Web / SDK / batch job]):::user
LB["Load Balancer / WAF<br/>(TLS, IP-level blocking)"]:::edge
end
subgraph GWLayer["Gateway layer (200 nodes)"]
GW["Gateway<br/>(limiter middleware + local LRU)"]:::app
end
subgraph CounterState["Counter state"]
R[("Redis Cluster<br/>6 shards · 1 replica each<br/>sharded by client key")]:::cache
CFG[("Config store<br/>Postgres · tiers · overrides")]:::db
end
subgraph Async["Async path"]
K{{"Kafka"}}:::queue
CH[("ClickHouse<br/>decision log · forensics")]:::db
end
API["API Service"]:::app
C --> LB
LB --> GW
GW --> R
GW -.config refresh 60s.-> CFG
GW -->|allowed| API
GW -.reject event.-> K
K --> CH
classDef user fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
classDef edge fill:#e2e8f0,stroke:#475569,color:#1e293b
classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
classDef db fill:#fed7aa,stroke:#c2410c,color:#7c2d12
classDef cache fill:#fecaca,stroke:#b91c1c,color:#7f1d1d
classDef queue fill:#ddd6fe,stroke:#6d28d9,color:#4c1d95
Five things to notice:
- The limiter is in-process middleware. A separate “rate limiter service” adds one network hop per API call. That doubles the latency budget.
- The local LRU on each gateway cuts Redis traffic by 80%+ for abusive clients. The clients you most need to stop are the ones least likely to hit Redis after the first few requests.
- Redis is sharded by client key. All counters for one client land on the same shard. The Lua script reads current-window and previous-window keys atomically, with no cross-shard coordination needed.
- The config service is read-mostly. Each gateway refreshes every 60 seconds. If the config service dies, gateways use their last-known config.
- The decision log captures every rejection. When customer support asks “why was customer X blocked at 3pm yesterday?”, you have an answer. Redis counters have already expired by then.
8. A request, end to end
sequenceDiagram
autonumber
participant Alice
participant LB as Load Balancer
participant GW as Gateway
participant LRU as Local LRU
participant R as Redis
participant API as API Service
Alice->>LB: GET /api/search (key: sk_pro_alice)
LB->>GW: forward
GW->>LRU: check sk_pro_alice:search
LRU-->>GW: not cached
rect rgb(241, 245, 249)
Note over GW,R: one pipelined round trip
GW->>R: EVALSHA(script, cur_key, prev_key, limit=1000, now_ms, cost=5)
R-->>GW: allowed=1, remaining=148, retry_after_ms=0
end
GW->>LRU: update cache (count=852, exp=100ms from now)
GW->>API: forward request
API-->>GW: 200 OK
GW-->>Alice: 200 + X-RateLimit-Remaining: 148
Target latencies:
| Operation | P99 |
|---|---|
| Allowed request (LRU hit, no Redis) | ~0.5ms |
| Allowed request (LRU miss, Redis call) | ~2ms |
| Rejected (LRU hit) | ~0.3ms |
9. The scaling journey: 10 users to 1 million
flowchart LR
S1["Stage 1<br/>10-100 users<br/>in-memory counters<br/>~$80/mo"]:::s1
S2["Stage 2<br/>~1,000 users<br/>+ Postgres config table<br/>+ admin UI<br/>~$250/mo"]:::s2
S3["Stage 3<br/>10k-100k users<br/>+ Redis Cluster<br/>+ local LRU<br/>+ decision log<br/>~$2-5k/mo"]:::s3
S4["Stage 4<br/>1M+ users<br/>+ per-region Redis<br/>+ cost-based limits<br/>+ time-limited overrides"]:::s4
S1 --> S2 --> S3 --> S4
classDef s1 fill:#e0f2fe,stroke:#0369a1,color:#0c4a6e
classDef s2 fill:#dcfce7,stroke:#15803d,color:#14532d
classDef s3 fill:#fef3c7,stroke:#a16207,color:#713f12
classDef s4 fill:#fce7f3,stroke:#be185d,color:#831843
Stage 1: 10 to 100 users
One app instance. In-memory counter dictionary. No Redis. Limits are constants in code. About $80/month.
Enough because you see ten requests a minute. Building more is over-engineering.
Stage 2: 1,000 users
What breaks: a customer asks for a higher limit and you have to deploy code to give it to them.
Move limits to a rate_limit_tiers Postgres table. Add a simple admin UI for overrides. Still one app instance, so in-memory counters are fine. ~$250/month.
Stage 3: 10,000 to 100,000 users
Multiple app instances. In-memory counters drift. A client can get Nx their limit, where N is the number of instances.
Add Redis. Atomic Lua script for check + increment. Add the local LRU. Shard Redis by client key with 6 primary shards, one replica each. Add the decision log to Kafka and ClickHouse. Add per-route fail mode. Cost: $2-5K/month.
Stage 4: 1 million users
New problems: EU data residency, expensive-vs-cheap route costs, enterprise batch bursts, botnet attacks rotating IPs.
Add per-region Redis clusters (counters are not synced cross-region: a client using two regions can get up to 2x their limit, which is acceptable and documented). Add cost-based limits via the cost parameter. Add time-limited overrides via valid_until. Add layered keys and an abuse-detection pipeline.
The core architecture has not changed since Stage 3. You added regions, knobs, and a separate abuse pipeline.
10. Reliability
Redis shard failure. Sentinel or Cluster promotes the replica. During the 10-30 second window, some counters serve stale data. The gateway falls into degraded mode. Fail-open or fail-closed per route.
Network partition. One gateway loses Redis while others keep it. That gateway should fail its health check and stop receiving traffic. Serving rate-limited traffic from a node that cannot enforce limits is worse than taking it out of rotation.
Slow Redis. Use a per-call timeout (5ms) with a circuit breaker. After N consecutive timeouts the gateway flips to degraded mode. This prevents Redis latency from bleeding into API latency.
Full cluster outage.
| Route class | Fail mode | Why |
|---|---|---|
/api/search, /api/list | fail-open | Uptime matters more than enforcement |
/api/transfer, /api/pay | fail-closed | Abuse risk is bigger than downtime |
/api/login | fail-closed | Brute-force protection is the whole point |
/api/webhook | fail-open with stricter local cap | Webhooks must keep flowing |
Counter drift after failover. If a Redis primary dies before AOF sync, some counters reset to zero. Affected clients get a small extra allowance. Acceptable.
Lua script bug. A bad script can lock a Redis shard. Scripts are versioned. The gateway sends a SCRIPT_HASH it can roll back. Canary new scripts to one gateway, watch metrics for 10 minutes, then roll out.
11. Observability
| Metric | Why it matters |
|---|---|
limiter.check.latency.p99 per gateway | Over 5ms means slow Redis, bad Lua, or connection pool exhaustion |
limiter.rejections_per_sec by tier | Spike means abuse or a customer hitting their cap |
limiter.local_cache_hit_rate | Should be >50% for hot keys. Drop suggests key churn or a rotating-key attack |
limiter.redis_op_rate | Tracks Redis load, drives sharding decisions |
limiter.degraded_mode.fail_open_count | Critical: Redis unreachable from at least one node |
limiter.degraded_mode.fail_closed_count | Critical: same, on a fail-closed route |
limiter.script_eval_errors | Lua failures (script update bugs) |
config.refresh.lag_seconds | Stale config means wrong limits |
Page on: limiter.check.latency.p99 > 5ms for 5 minutes. Any limiter.degraded_mode.* sample. Any script_eval_errors.
Ticket on: rejection rate spike on any tier. Cache hit rate sustained drop.
12. Follow-up answers
1. Multiple keys per request.
Check all three (per-IP, per-API-key, per-user) in parallel via one Redis pipeline: one round trip, three EVALSHAs. Take the most restrictive result. Each key catches a different abuse vector. Checking sequentially adds 3x latency. Checking only one leaves a gap.
2. Limits per route.
Map URL patterns to a route_class:
1
2
3
/api/v1/search -> "search"
/api/v1/transfer -> "write"
/health -> "free" (cost = 0, always passes)
The rate limit key includes route_class. Each client has multiple parallel counters. Search consumption does not eat into write consumption.
3. Cost-based limits.
The Lua script takes a cost parameter. The gateway looks up the cost for the matched route. The script does INCRBY cost instead of INCR 1. The check becomes estimated + cost > limit. Cost weights should reflect real backend cost. A search hitting Elasticsearch is 50x more expensive than a Postgres lookup, so search costs 50.
4. Burst allowance.
Switch the route to token bucket. Config:
1
2
3
4
5
6
tier: pro
limits:
default:
algorithm: token_bucket
bucket_size: 200
refill_rate: 16.67
The bucket holds 200 tokens and refills at 16.67/sec (1,000/min). The first 200 requests go through immediately. Then throttled to 16.67/sec. The switch is one branch in the limiter. Both algorithms have Lua scripts.
5. IP rotation (botnet).
Per-IP limits fail against botnets. Defense is layered:
- API-key-level limits catch a botnet rotating IPs with one stolen key.
- Behavioral signals across IPs: user-agent stability, request path patterns, inter-request timing. 10,000 IPs all hitting
/api/loginwith the same user-agent and exactly 1.0s between requests is one attacker. Feed this into a fraud-detection pipeline that writes to ablocked_keysRedis set. - Subscribe to a commercial IP reputation feed. Apply stricter limits to known-botnet ranges.
The limiter is one defense, not the only one. Treat it as a speed bump.
6. Customer override.
Stored in rate_limit_overrides (Postgres). Each gateway pulls a snapshot every 60 seconds. Staleness of 60 seconds is fine: limits are not transactional. If the config service is down, gateways use last-known config indefinitely.
For emergency changes (“block this customer NOW”), a separate blocked_keys Redis set is consulted on every request. Writing to it propagates in milliseconds. Not the everyday path.
7. Distributed accuracy.
The Lua script is atomic at the Redis shard level. Only one of two concurrent requests can read count = 99 and increment to 100. The other reads count = 100 and is rejected. Lua atomicity is the whole defense.
The only race the design accepts is the local LRU’s 100ms inconsistency window. Theoretical max overshoot with 200 nodes: 400 extra requests per window. Practical overshoot in load tests: under 10. Documented. No one has complained.
8. Pre-warming a known burst.
Write a row to rate_limit_overrides with valid_until = 2026-05-28 02:01:00. Active only for that minute. The gateway picks it up within 60 seconds. For recurring batch use cases, a separate /batch endpoint with its own higher limits (backed by a separate API key) is cleaner.
9. “User got popular.”
The limit enforced correctly if set to protect the API. Whether it was the right call depends on whether the goal is protecting the API or protecting the customer from runaway cost.
Smarter signals: compare the current minute’s rate against the customer’s 7-day P95. A 10x step-change sustained for 2 minutes triggers an automated email: “Your API saw a 10x spike. You have been temporarily throttled. Click here to raise your limit.” Self-service relief. If the traffic pattern is bot-like (single user-agent, one geo, repeated path), throttle. If it is human (mixed user-agents, geo-distributed, varied paths), that is real viral traffic and it is worth auto-raising the limit for a paid account.
10. The limiter is the bottleneck.
Limiter middleware is adding 8ms P99. Budget was 2ms.
- Redis latency. Run
redis-cli --latency. Check shard CPU, network path, AOF sync settings. - Pipelining. Are multiple-key checks sequential instead of pipelined? 3 sequential calls x 2ms each = 6ms. Should be one pipelined call.
- LRU miss rate. If the cache misses on every request, every check hits Redis. Cache too small, or every request has a unique key (sign of a rotating-key attack).
- Lua complexity. Has someone added expensive logic? Profile with
SLOWLOG. - Gateway CPU. If the pod is pegged, even fast Redis responses queue.
- Connection pool exhaustion. If the Redis connection pool is small, requests wait. Raise the pool or add gateway nodes.
The first three explain ~90% of latency regressions in practice.
13. Trade-offs worth saying out loud
Per-region vs global limits. Per-region is the default. Cross-region counter sync adds 100ms+ to every check. Global limits exist only for enterprise contracts that require them. Document the trade-off explicitly. A client using two regions can get up to 2x their limit. That is not a bug; it is the price of low latency.
Why the limiter is a library, not a service. A separate rate-limiter service adds one network hop per API call. At a 2ms budget, one hop already takes 1-2ms. There is no room for another.
Why Lua, not application-level transactions. A read followed by a conditional write in two separate Redis calls is not atomic. Two concurrent requests can both read count = 99, both pass the check, and both increment. Lua runs serially within a Redis shard. No coordination protocol needed.
Why not the cloud provider’s managed limiter. AWS WAF, Cloudflare, and similar tools are good for IP-level and bot mitigation. They are not flexible enough for tiered customer pricing or cost-based limits. Most production APIs use both: cloud limiter as the outer ring for IP threats, custom limiter at the gateway for business-rule enforcement.
What to revisit at 10x scale. Move counter state from Redis to a purpose-built counter store (ScyllaDB or a custom in-memory grid) once Redis cost becomes meaningful. Push limit checks to the edge (PoPs) for global APIs, accepting gossip-driven inaccuracy. Move tier resolution into a sidecar proxy (Envoy) so gateway code does not need to know about tiers.
14. Common mistakes
Most weak answers fall into one of these:
“Just use a token bucket in Redis” with nothing else. Skipped clarification. Did not address shared state across nodes. Did not address failure mode. Stops the interview.
No local LRU. At 300K QPS, every check is a round trip. The local LRU is the cheapest, highest-impact optimization. A senior candidate names it without prompting.
Per-IP only. Fails against NAT (corporate offices, mobile carriers) and against IP rotation. Layered keys is the correct answer.
No fail-open vs fail-closed distinction. The interviewer will ask. Have a route-level opinion, not a single global choice.
Forgetting Retry-After. Polite clients respect it. Sending it signals you have built real public APIs.
Limiter as a separate service. Now every API call pays two network hops instead of one. The latency budget does not survive.
Confusing fixed and sliding windows. Fixed resets on the clock boundary: 100 at 11:59:59 plus 100 at 12:00:00 = 200 in two seconds. Sliding counts the last N seconds rolling. The boundary doubling problem is why sliding exists.
One Redis box, no replication, no failure plan. Half the design score is in operational reality.
If you name 8 of these without prompting, you are interviewing at staff level. The three that separate strong from average: the local LRU, layered keys, and per-route fail mode.