Design a Shopping Cart Service
What we are building
Alice adds a $79 pair of shoes to her cart on her laptop. An hour later she opens her phone, and the shoes are still there. She clicks Buy, the shoes ship, and her credit card gets charged exactly once.
That is the whole product. It sounds like three database rows. The interesting problems are hiding underneath.
Five hard problems show up in every real cart:
- Cross-device sync. The cart must live on the server, not in a browser cookie. A cookie only works on one device.
- Guest-to-user merge. Alice browses as a guest, adds 3 items, then logs in. She already had 2 items saved. Now what?
- Inventory race. The cart says a shoe is in stock. Twenty minutes later, someone else took the last pair. Alice clicks Buy.
- Idempotent checkout. Alice’s phone drops the network mid-checkout. The app retries. She must not get two orders or two charges.
- Abandoned cart cleanup. Sixty percent of carts are never bought. Ghost carts pile up. Guest cart tokens accumulate in the database forever.
We will start with the smallest version that works, then add one piece at a time as each problem appears.
The lifecycle of one cart
Every cart moves through a small set of states. Picture it once before drawing any architecture.
stateDiagram-v2
direction LR
[*] --> Active: first item added
Active --> Active: add / update / remove item
Active --> Merging: Alice logs in (guest to account)
Merging --> Active: merge complete
Active --> CheckingOut: Alice clicks Checkout
CheckingOut --> Converted: payment succeeds
CheckingOut --> Active: payment fails, cart restored
Active --> Abandoned: no activity for 6 hours
Converted --> [*]
Abandoned --> [*]
Everything else in this design (Redis, Kafka, inventory holds, price drift handling) is a complication on top of this one state machine.
Take this with you. A cart is a small state machine per user. The hard part is not the state machine. It is what happens between Active and Converted.
How big this gets
Same product, two very different sizes.
| Input | Small shop (500 DAU) | Big shop (1M DAU) |
|---|---|---|
| Carts per day | 150 | 300,000 |
| Cart writes per second (peak) | ~0.01 | ~21 |
| Cart icon reads per second (peak) | ~0.06 | ~350 |
| Active carts at any moment | ~50 | ~25,000 |
| Live storage | ~33 MB/year | ~7 GB |
Show: how the numbers come out
Assume 30% of visitors add at least one item. Average cart has 3 items, edited twice.
Small shop (500 visitors/day):
- Carts: 500 × 30% = 150 per day
- Cart writes: 150 × 2 edits = 300/day, so 0.003/sec steady
- Cart icon reads: 500 visitors × 10 page views = 5,000/day, so 0.06/sec
- Active carts: 150 carts × ~8h average life / 24h = ~50 open at any moment
- Storage: 150 carts × 3 items × 200 bytes = ~90 KB/day, ~33 MB/year
Big shop (1M visitors/day):
- Carts: 1M × 30% = 300,000/day, so 3.5/sec steady
- Cart writes: 600,000/day, so 7/sec steady, 21/sec peak
- Cart icon reads: 1M × 10 page views = 10M/day, so 115/sec steady, 350/sec peak
- Active carts (30-day TTL): 300,000 × 30 days / 30 = ~25,000 open at any moment
- Storage: 300K carts × 3 items × 200 bytes = ~180 MB/day, ~7 GB live
The number that matters: writes are tiny even at 1M users. Any database handles 21 writes/sec. The real challenge is the cart icon read on every page: 350/sec with a tight latency target. That single endpoint sets the caching strategy.
| Metric | At 1M users |
|---|---|
| Writes/sec (peak) | ~21. Any database handles this. |
| Icon reads/sec (peak) | ~350. This is the design constraint. |
| Active carts in Redis | ~25,000. About 5 MB as compact hashes. |
| Real bottleneck | Cart icon read on every page, not the buy button. |
Take this with you. The cart is a read-heavy problem disguised as a write problem. Optimize the icon read, not the add-item write.
The smallest version that works
One Postgres, one app server, logged-in users only.
flowchart LR
A([Alice]):::user --> CS["Cart Service"]:::app
CS --> DB[("Postgres<br/>carts + cart_items")]:::db
CS --> Cat["Catalog<br/>(name, price, stock)"]:::ext
classDef user fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
classDef db fill:#fed7aa,stroke:#c2410c,color:#7c2d12
classDef ext fill:#e9d5ff,stroke:#7e22ce,color:#581c87
Three endpoints carry the full product at this stage.
| Endpoint | What it does |
|---|---|
GET /api/v1/cart | Return items, quantities, snapshot prices, and live availability |
POST /api/v1/cart/items | Add an item or increase its quantity |
PATCH /api/v1/cart/items/{sku} | Update quantity; qty 0 removes the item |
Show: the two tables
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
CREATE TABLE carts (
cart_id UUID PRIMARY KEY,
user_id BIGINT,
cart_token UUID,
status TEXT NOT NULL DEFAULT 'active',
item_count INT NOT NULL DEFAULT 0,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
expires_at TIMESTAMPTZ
);
CREATE TABLE cart_items (
cart_id UUID NOT NULL REFERENCES carts(cart_id),
sku TEXT NOT NULL,
qty INT NOT NULL CHECK (qty > 0 AND qty <= 99),
snapshot_price_cents INT NOT NULL,
added_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (cart_id, sku)
);
item_count is denormalized on the carts row. The cart icon on every page only needs that one number: one row read, no JOIN, no catalog call.
snapshot_price_cents records what Alice saw when she added the item. If the price changes tomorrow, the audit trail still shows what she was shown.
This is enough for a hundred users. The interesting question is what breaks first as the system grows.
Decision 1: where does the cart live for guests?
Marketing asks: can users browse and add items without creating an account? Almost every real shop says yes. This single answer changes the data model and adds a merge step at login.
The fix is a cart_token: a random UUID stored in a browser cookie. The cart lives on the server, keyed by that token instead of a user ID. The cookie just points at the row.
Now a new problem: Alice builds a guest cart with 3 shoes over 20 minutes. She logs in. She already had 2 shoes saved from last week.
flowchart TD
Login([Alice logs in]) --> BothExist{Both carts exist?}
BothExist -- "No guest cart" --> UseAccount["Return account cart unchanged"]:::ok
BothExist -- "No account cart" --> Rebind["Rebind guest cart to user_id<br/>(clear cart_token)"]:::ok
BothExist -- "Both have items" --> Merge["For each SKU: take max(guest qty, account qty)<br/>Skip discontinued items<br/>Respect size limit"]:::ok
Merge --> Audit["Write carts_merged audit row<br/>Delete guest cart<br/>Clear guest cookie"]:::ok
classDef ok fill:#dcfce7,stroke:#15803d,color:#14532d
The quantity rule matters. If the guest cart has 2 of shoe-A and the account cart has 1, Alice almost certainly wants 2, not 3. Take the max, not the sum.
The merge runs inside a serializable transaction. Alice might double-click Log In. Two concurrent merge calls race. The second finds the guest cart already deleted and returns the account cart unchanged. No duplicate merge.
Show: the carts_merged audit table and merge sketch
1
2
3
4
5
6
7
8
9
10
11
CREATE TABLE carts_merged (
merge_id UUID PRIMARY KEY,
user_id BIGINT NOT NULL,
anonymous_token UUID,
anonymous_items JSONB NOT NULL,
account_items JSONB NOT NULL,
merged_items JSONB NOT NULL,
rule_applied TEXT NOT NULL,
trimmed_items JSONB,
occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Every merge writes a row here regardless of outcome: rebind, full merge, or no-op. When Alice emails support “my cart is wrong after I logged in,” you have the answer. The data is cheap to store. The audit is irreplaceable.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def merge_carts(anonymous_token, user_id):
with db.transaction(isolation="serializable"):
anon_cart = db.fetch_cart(cart_token=anonymous_token, lock=True)
user_cart = db.fetch_cart(user_id=user_id, lock=True)
if anon_cart is None:
return user_cart
if user_cart is None:
db.update(anon_cart.id, user_id=user_id, cart_token=None)
audit_merge(user_id, anonymous_token, rule="rebind")
return db.fetch_cart(user_id=user_id)
merged = {item.sku: item.copy() for item in user_cart.items}
trimmed = []
for item in anon_cart.items:
if not catalog.is_available(item.sku):
trimmed.append(item.sku)
continue
if item.sku in merged:
merged[item.sku].qty = min(
max(item.qty, merged[item.sku].qty), MAX_QTY_PER_ITEM
)
else:
if len(merged) >= MAX_CART_ITEMS:
trimmed.append(item.sku)
continue
merged[item.sku] = item
db.replace_items(user_cart.id, merged.values())
db.delete(anon_cart.id)
audit_merge(user_id, anonymous_token, rule="qty:max", trimmed=trimmed)
return db.fetch_cart(user_id=user_id)
Take this with you. The merge on login is where most cart designs break. Max-qty rule, one serializable transaction, audit row, clear the cookie.
Decision 2: how do we handle the inventory race?
Alice adds a shoe to her cart at 2 PM. At 2:20 PM she clicks Buy. Someone else took the last pair at 2:15 PM. Three approaches handle this. None is perfect.
flowchart TD
Add([Alice clicks Add to Cart]) --> A{Which approach?}
A --> Opt["A. Optimistic<br/>Show last-known state.<br/>Re-check at checkout."]
A --> Res["B. Soft reservation<br/>Place a 15-min hold on add.<br/>Expire if cart abandoned."]
A --> None["C. No check<br/>Accept everything.<br/>Sort it out later."]
Opt --> OptBad["1-3% of checkouts<br/>hit sold-out surprise at the last step"]:::bad
Res --> ResBad["Ghost carts hold inventory for 15 min.<br/>Hot items look sold out to real buyers."]:::bad
None --> NoneBad["'Sorry, cannot ship.<br/>Here is your refund.' email."]:::bad
classDef bad fill:#fecaca,stroke:#b91c1c,color:#7f1d1d
Show: comparison table and recommendation
| Approach | Normal case | Failure mode | Build cost | Right for |
|---|---|---|---|---|
| A. Optimistic | Works fine. Checkout re-checks. | 1-3% of checkouts find item gone at the last step. | Low. | Default for most shops. |
| B. Soft reservation | User never sees a sold-out surprise mid-checkout. | Ghost carts hold inventory for 15 min. Hot items show as sold out to real buyers. | High. Inventory needs hold, release, and TTL expiry logic. | Concert tickets, limited sneaker drops. |
| C. No check | Always accepts. Fast. | “Cannot ship, here is your refund.” | Near zero. | Pre-orders, print-on-demand. |
Default: optimistic. Reservation only for SKUs explicitly flagged requires_reservation=true.
Industry cart abandonment is 60-70%. If every add-to-cart held inventory for 15 minutes, ghost carts would make real inventory look empty. That is right for a Taylor Swift ticket sale. Wrong for shoes.
Division of responsibility:
- Cart service: read-only availability check on add. Show what we believe. No writes to inventory.
- Order service: authoritative
try_reserve(sku, qty)at checkout. If it fails, no order, no charge.
The cart’s job is to show good information. The order service’s job is to make the buy real.
Take this with you. The cart shows what is probably true. The Order Service makes it actually true. Never put the inventory guarantee in the cart.
Decision 3: how do we make checkout idempotent?
Alice’s phone drops the connection mid-checkout. The app retries. Without idempotency, the cart service processes the checkout twice and charges her twice.
The fix is an Idempotency-Key header on every mutating request: a UUID the client generates once per logical operation. The server records the key and the result. If the same key arrives again, it returns the recorded result without re-processing.
sequenceDiagram
autonumber
participant Phone as Alice's phone
participant GW as API Gateway
participant CS as Cart Service
participant DB as Postgres
Phone->>GW: POST /checkout Idempotency-Key: abc-123
GW->>CS: forward
rect rgb(241, 245, 249)
Note over CS,DB: check-then-act, one transaction
CS->>DB: SELECT FROM idempotency_keys WHERE key='abc-123'
DB-->>CS: not found
CS->>DB: INSERT idempotency_keys (key, status='processing')
CS->>DB: frozen snapshot of cart
CS->>DB: COMMIT
end
Note over CS: hand snapshot to Order Service (async)
CS-->>Phone: 202 Accepted, order_id: ord-789
Note over Phone: network drops here
Phone->>GW: POST /checkout Idempotency-Key: abc-123 (retry)
GW->>CS: forward
CS->>DB: SELECT FROM idempotency_keys WHERE key='abc-123'
DB-->>CS: found, status='processing', order_id=ord-789
CS-->>Phone: 202 Accepted, order_id: ord-789 (same response)
The cart does not clear until the Order Service emits a cart.converted event. Payment failure emits cart.checkout_failed, holds release, and the cart is intact for editing.
Take this with you. Idempotency keys plus a frozen snapshot. The key catches retries. The snapshot decouples cart state from payment outcome.
Decision 4: how do we serve the cart icon fast?
The cart icon appears on every page. At 1M users, that is 350 icon reads per second. Each read only needs one number: how many items are in Alice’s cart.
The naive path is a SELECT COUNT(*) FROM cart_items WHERE cart_id = ?. At 350/sec this starts showing up in slow query logs within a few weeks of launch.
Two things fix this. First, denormalize item_count onto the carts row and keep it updated in the same transaction as item changes. Second, cache the whole cart hash in Redis so icon reads never touch Postgres.
flowchart TB
Alice([Alice]):::user --> GW["API Gateway"]:::edge
GW -->|"GET /cart/icon (~350/sec)"| CS["Cart Service"]:::app
CS -->|"HGET cart:user:{uid} item_count (~95% hit)"| R[("Redis<br/>cart:user:{uid}<br/>~5 MB for 25K active carts")]:::cache
CS -.miss.-> DB[("Postgres<br/>item_count on carts row")]:::db
R -->|"~20 ms"| GW
DB -->|"~40 ms, repopulates Redis"| GW
classDef user fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
classDef edge fill:#e2e8f0,stroke:#475569,color:#1e293b
classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
classDef db fill:#fed7aa,stroke:#c2410c,color:#7c2d12
classDef cache fill:#fecaca,stroke:#b91c1c,color:#7f1d1d
Redis holds the compact cart: SKU, qty, snapshot price. One HGET returns the count. Catalog and inventory results are not stored in Redis; they change too fast and are fetched fresh on each full cart page load.
Take this with you. Denormalize
item_count. Cache it in Redis. One field read, no JOIN, no catalog call. That is how the icon stays under 20 ms.
Decision 5: how do we clean up abandoned carts?
60-70% of carts are never purchased. Guest carts accumulate with no user to notify. At 1M users, that is 180,000 abandoned carts per day with no automatic cleanup.
Two separate problems: finding carts to email, and deleting dead rows.
Finding carts to email: a nightly job queries carts where status = 'active' and updated_at is exactly 6 hours old, within a narrow window.
flowchart LR
Job(["Abandonment job<br/>(runs every 15 min)"]):::app
DB[("Postgres<br/>idx on status, updated_at")]:::db
K{{"Kafka<br/>cart.abandoned"}}:::queue
Email["Notification Service<br/>(email)"]:::app
Dedup[("cart_abandonment_emails<br/>(dedup table)")]:::db
Job -->|"SELECT carts where updated_at in [6h, 6h15m) ago"| DB
DB -->|"matching cart_ids"| Job
Job -->|"emit cart.abandoned"| K
K --> Email
Email --> Dedup
classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
classDef db fill:#fed7aa,stroke:#c2410c,color:#7c2d12
classDef queue fill:#ddd6fe,stroke:#6d28d9,color:#4c1d95
The narrow time window means the query only touches carts that just crossed the threshold. The partial index on (status, updated_at) makes it fast regardless of total cart count.
Deleting dead rows: guest carts get expires_at = NOW() + 30 days, refreshed on every activity. A nightly GC job deletes rows where expires_at < NOW() and status != 'converted'. Converted carts stay for the order audit trail.
Take this with you. Narrow time windows, not full scans. Emit events to Kafka, do not email directly from the job. Dedup on delivery, not on send.
The full architecture
Putting the five decisions together.
flowchart TB
subgraph Edge["Client edge"]
C([Web / Mobile]):::user
GW["API Gateway<br/>(auth · cart_token · rate limit)"]:::edge
end
subgraph WritePath["Synchronous write path"]
CS["Cart Service<br/>(stateless pods)"]:::app
Cat["Catalog Service"]:::ext
Inv["Inventory Service"]:::ext
end
DB[("Postgres<br/>carts · cart_items · carts_merged")]:::db
R[("Redis<br/>cart:user:{uid}<br/>~5 MB active")]:::cache
K{{"Kafka<br/>cart.item.added · cart.merged<br/>cart.abandoned · cart.converted"}}:::queue
subgraph Consumers["Async consumers"]
AB["Abandoned cart emails"]:::app
AN[("Analytics<br/>(ClickHouse)")]:::db
FR["Fraud check"]:::app
ORD["Order Service"]:::app
end
C --> GW
GW --> CS
CS --> Cat
CS --> Inv
CS --> R
CS --> DB
DB -->|CDC / outbox| K
K --> AB
K --> AN
K --> FR
K --> ORD
classDef user fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
classDef edge fill:#e2e8f0,stroke:#475569,color:#1e293b
classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
classDef db fill:#fed7aa,stroke:#c2410c,color:#7c2d12
classDef cache fill:#fecaca,stroke:#b91c1c,color:#7f1d1d
classDef queue fill:#ddd6fe,stroke:#6d28d9,color:#4c1d95
classDef ext fill:#e9d5ff,stroke:#7e22ce,color:#581c87
Each component in one sentence:
| Component | Purpose |
|---|---|
| API Gateway | Auth, cart_token cookie issuance, rate limiting per IP and user. |
| Cart Service | Stateless. Owns merge logic, size limits, price snapshot, idempotency check. |
| Catalog Service | Name, image, current price per SKU. Called on cart page load, in parallel with Inventory. |
| Inventory Service | Stock availability. Cart reads it. Never writes to it. |
| Postgres | Source of truth. Three tables: carts, cart_items, carts_merged. |
| Redis | Fast cache for active carts. Icon read lives here (~5 MB for 25K active carts). |
| Kafka | Carries cart events to downstream teams. Abandonment, analytics, fraud, order service. |
| Order Service | Authoritative inventory reserve + payment. Cart does not clear until Order Service confirms. |
| Analytics, Fraud, Emails | Downstream consumers. If any dies, cart adds and reads still work. |
Walk: add to cart, end to end
Alice adds a shoe on her laptop.
sequenceDiagram
autonumber
participant Alice
participant GW as API Gateway
participant CS as Cart Service
participant Cat as Catalog
participant Inv as Inventory
participant DB as Postgres
participant R as Redis
participant K as Kafka
Alice->>GW: POST /cart/items {sku: shoe-blue-42, qty: 1}<br/>Idempotency-Key: f3a1-...
GW->>CS: forward (auth ok, key not seen before)
CS->>Cat: price + name for shoe-blue-42?
CS->>Inv: is shoe-blue-42 in stock?
Note over CS,Inv: parallel calls (~30 ms each)
Cat-->>CS: $79, Blue Runner Size 42
Inv-->>CS: in stock
rect rgb(241, 245, 249)
Note over CS,DB: one transaction
CS->>DB: INSERT cart_items ON CONFLICT DO UPDATE qty
CS->>DB: UPDATE carts SET item_count = item_count + 1, updated_at = NOW()
CS->>DB: COMMIT
end
CS->>R: HSET cart:user:42 shoe-blue-42 {qty:1, price:7900} (~2 ms)
CS->>K: emit cart.item.added (fire and forget)
CS-->>Alice: 201 Created (~80 ms total)
Three details worth noting:
- Catalog and Inventory are called in parallel. Total latency is
max(catalog, inventory), not the sum. - The DB write and the
item_countupdate are one transaction. A crash mid-write rolls back cleanly. - Redis is written after the commit. If Redis fails, Postgres has the truth and repopulates on the next read.
Walk: cross-device sync
Alice adds shoes on her laptop at 2 PM. She opens her phone at 3 PM.
sequenceDiagram
autonumber
participant Laptop as Alice's laptop
participant Phone as Alice's phone
participant CS as Cart Service
participant R as Redis
participant DB as Postgres
Laptop->>CS: POST /cart/items {sku: shoe-blue-42}
CS->>DB: INSERT cart_items, UPDATE item_count
CS->>R: HSET cart:user:42 shoe-blue-42 {qty:1}
CS-->>Laptop: 201 Created
Note over Phone: 1 hour later
Phone->>CS: GET /cart (Authorization: Bearer <token>)
CS->>R: HGETALL cart:user:42
R-->>CS: shoe-blue-42 {qty:1, price:7900} (~5 ms)
CS->>CS: hydrate with catalog names and live availability
CS-->>Phone: 200 {items: [{sku: shoe-blue-42, name: Blue Runner, qty: 1, price: $79}]}
The cart is keyed by user_id, not by device. Any device with a valid session token reads the same cart. No sync protocol needed.
The hard sub-problem: inventory hold timing
When a soft reservation is used (flagged SKUs like limited drops), the hold timer creates a cascading problem.
flowchart TD
Add([Alice adds shoe to cart<br/>Inventory hold placed: 15 min]) --> T1{15 min pass<br/>without checkout}
T1 -- "Hold expires" --> Released["Inventory released<br/>Another buyer can take it"]:::ok
T1 -- "Alice checks out at 14:59" --> HoldActive{Hold still valid?}
HoldActive -- Yes --> Convert["Order Service converts hold to purchase"]:::ok
HoldActive -- "Race: hold just expired" --> SoldOut["Checkout fails: sold out<br/>Show error to Alice"]:::bad
Released --> AnotherBuyer["Bob adds shoe<br/>Gets a fresh 15-min hold"]:::ok
AnotherBuyer --> BobChecksOut["Bob checks out<br/>Hold converts to purchase"]:::ok
classDef ok fill:#dcfce7,stroke:#15803d,color:#14532d
classDef bad fill:#fecaca,stroke:#b91c1c,color:#7f1d1d
The race at the bottom (hold just expired while checkout is in flight) is unavoidable. The mitigation is a grace window: the Order Service attempts the reserve even if the hold shows as expired within the last 30 seconds. If the inventory count still has availability, the purchase goes through.
| Hold timer | Trade-off |
|---|---|
| 5 minutes | Fewer ghost holds. More sold-out surprises at checkout. |
| 15 minutes | Standard for limited drops. Significant ghost hold problem at high abandonment. |
| 30 minutes | Feels safe for the user. Ghost holds lock out real buyers for half an hour. |
For most shops: skip holds entirely. Use optimistic inventory. Reserve only at checkout, inside the Order Service, as an atomic step with payment.
Take this with you. Hold timers solve one problem (surprise at checkout) and create another (ghost holds blocking real buyers). Keep them short and scope them to SKUs that genuinely need them.
Follow-up questions
Try answering each in 2 or 3 sentences before opening the solution.
Bots stuff a cart with 10,000 items. What goes wrong? How do you stop it?
Phone-to-laptop sync delay. Alice adds a shoe on her phone. She opens her laptop 5 seconds later. The cart shows the old state. How long is acceptable? How do you fix it?
Redis goes down mid-day. All active carts are cached in Redis. What does Alice see? How do you recover without losing any carts?
Price went up. Alice added a shoe at $79 last week. Today it is $89. What does she pay? What does she see at checkout?
Abandoned cart emails. You want to email shoppers 6 hours after their last activity. How do you find those carts without scanning every active cart every minute?
Anonymous carts pile up. When do you delete them? What happens if a user returns after 90 days with the same old cookie?
Two people share one account. Both log in from different cities and add items at the same time. What happens?
Currency switch. Alice adds a shoe priced in USD. She switches the site to EUR. What happens to the snapshot price?
Item becomes restricted. Alice added a legal item. A new regulation restricts shipping it to her state. She goes to checkout. What does the system do?
Save for later. Alice wants to move an item from her cart to a wishlist. Is this the cart service’s job? Where does the wishlist live?
Checkout race. Two sessions for the same user both hit checkout within 100 ms of each other (browser tab dupe, mobile + desktop). What prevents a double order?
Cart grows to 200 items. When do you enforce a size limit? Where does the limit live?
Related problems
- Approval Management (011). Same patterns: state per user, event stream on changes, audit table on transitions.
- Coupon Redemption (014). The cart holds a coupon code. The coupon service decides validity. Same service boundary as inventory.
- Read-Heavy System Patterns (017). The cart icon read on every page is a classic read-heavy load. The Redis-plus-DB pattern applies directly.
- Write-Heavy System Patterns (018). The Kafka event stream for analytics is the write-heavy pattern at scale.
- Help Desk Ticketing (019). “My cart is wrong after login” support tickets need the
carts_mergedaudit table to answer.
Try the problem on your own first. Solutions are most valuable after you've struggled with it.
Solution: Shopping Cart Service
The short version
A shopping cart is a small amount of mutable state per user: add items, change quantities, buy or leave. The state machine is trivial. What makes the design interesting is five specific problems: cross-device sync, guest-to-user merge at login, the inventory race at checkout, idempotent checkout under retries, and cleaning up abandoned carts without blocking the write path.
The answers: Postgres as source of truth, Redis as a fast read layer for icon counts, Kafka for downstream teams, and a careful merge algorithm on login. At 1M users, cart writes are only ~21/sec. The real load is the 350 icon reads per second that appear on every page of the site.
1. The two questions that matter most
Guests or login only? If guests can add without an account, you need a cart_token cookie, a server-side guest cart, a merge endpoint at login, and an audit table for what changed. This is the harder path and the right one for almost every real shop.
What does “in stock” mean? Optimistic (show last-known, re-check at checkout), soft reservation (hold inventory on add), or no check? This single decision changes how inventory and checkout interact. The right default is optimistic, with soft holds only for explicitly flagged SKUs.
Everything else follows from these two.
2. The math
| Scale | Carts/day | Writes/sec (peak) | Icon reads/sec (peak) | Active carts | Live storage |
|---|---|---|---|---|---|
| Small (500 DAU) | 150 | 0.01 | 0.06 | ~50 | ~33 MB/year |
| Big (1M DAU) | 300,000 | 21 | 350 | ~25,000 | ~7 GB |
What the numbers say:
- Writes are tiny even at 1M users. A single Postgres handles 21 writes/sec without effort.
- The icon read is the load to optimize: 350/sec with a target under 20 ms. That pushes you to Redis and denormalized
item_count. - 25,000 active carts fit in Redis as compact hashes. About 5 MB total. Trivial.
- The bottleneck is not the cart writes. It is the Inventory service, which is called on every full cart page load.
3. The API
Five endpoints carry the whole product.
1
2
3
4
5
GET /api/v1/cart
POST /api/v1/cart/items Idempotency-Key: <uuid>
PATCH /api/v1/cart/items/{sku} qty: 0 means remove
DELETE /api/v1/cart/items/{sku}
POST /api/v1/cart/merge body: {"anonymous_token": "<uuid>"}
GET /cart returns a hydrated response: SKU + qty from Postgres/Redis, joined with name, image, and current price from Catalog, and availability from Inventory. The join happens on the server. Never push it to the browser.
| Status code | Meaning |
|---|---|
| 201 | Item added |
| 200 | Item already present, quantity updated |
| 400 | Quantity out of range or bad SKU |
| 404 | SKU does not exist |
| 409 | SKU is restricted (region, age gate) |
| 410 | SKU is discontinued |
| 422 | Cart is full (100-item limit) |
Load-bearing choices:
- Idempotency-Key required on writes. A phone retries on flaky Wi-Fi. Without the key, a dropped connection gives qty 2 when the user wanted 1, or two checkout attempts.
- Snapshot price and current price both return on every read. Snapshot is what Alice saw when she added. Current is what she pays. Show both. Audit needs both.
- Checkout returns a session token, not an order. The cart does not clear until the Order Service confirms payment. Payment failure leaves the cart intact.
4. The data model
Three tables: two for live data, one for audit.
erDiagram
carts ||--o{ cart_items : contains
carts ||--o{ carts_merged : "audit on merge"
carts {
uuid cart_id PK
bigint user_id
uuid cart_token
text status
int item_count
timestamptz updated_at
timestamptz expires_at
}
cart_items {
uuid cart_id FK
text sku
int qty
int snapshot_price_cents
text hold_token
timestamptz added_at
}
carts_merged {
uuid merge_id PK
bigint user_id
uuid anonymous_token
jsonb anonymous_items
jsonb account_items
jsonb merged_items
text rule_applied
jsonb trimmed_items
timestamptz occurred_at
}
Show: the full SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
CREATE TABLE carts (
cart_id UUID PRIMARY KEY,
user_id BIGINT,
cart_token UUID,
status TEXT NOT NULL DEFAULT 'active',
item_count INT NOT NULL DEFAULT 0,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
expires_at TIMESTAMPTZ,
CHECK ((user_id IS NULL) <> (cart_token IS NULL))
);
CREATE UNIQUE INDEX idx_carts_user
ON carts (user_id) WHERE status = 'active' AND user_id IS NOT NULL;
CREATE UNIQUE INDEX idx_carts_token
ON carts (cart_token) WHERE status = 'active' AND cart_token IS NOT NULL;
CREATE INDEX idx_carts_abandonment
ON carts (updated_at) WHERE status = 'active';
CREATE TABLE cart_items (
cart_id UUID NOT NULL REFERENCES carts(cart_id) ON DELETE CASCADE,
sku TEXT NOT NULL,
qty INT NOT NULL CHECK (qty > 0 AND qty <= 99),
snapshot_price_cents INT NOT NULL,
added_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
hold_token TEXT,
PRIMARY KEY (cart_id, sku)
);
CREATE INDEX idx_cart_items_sku ON cart_items (sku);
CREATE TABLE carts_merged (
merge_id UUID PRIMARY KEY,
user_id BIGINT NOT NULL,
anonymous_token UUID,
anonymous_items JSONB NOT NULL,
account_items JSONB NOT NULL,
merged_items JSONB NOT NULL,
rule_applied TEXT NOT NULL,
trimmed_items JSONB,
occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_merged_user ON carts_merged (user_id, occurred_at DESC);
Four choices worth defending:
The CHECK constraint enforces that exactly one of user_id or cart_token is set. A cart is either owned or guest. After merge, the guest row is deleted and the constraint ensures no cart becomes orphaned.
item_count is denormalized. The cart icon on every page needs one number: one Redis field read, no JOIN, no catalog call. Updated in the same transaction as item changes so it never goes stale.
snapshot_price_cents on cart_items. Records the price when Alice added the item. The total is computed fresh at checkout from current prices. The snapshot is for display and audit.
carts_merged has no business logic. It is a record of what happened. Every merge writes a row: what was in each cart, what the rule was, what was trimmed. When a user emails support, you have the answer.
Why Postgres and not DynamoDB? Merging two carts atomically is one transaction. Idempotency key checks are one transaction. The data is small (7 GB at 1M users). Postgres gives ACID in one box. DynamoDB would require a custom transaction layer for merge.
5. The merge algorithm
This is where most cart designs break.
flowchart TD
Start([POST /cart/merge called]) --> FetchBoth["Fetch both carts with SELECT FOR UPDATE"]
FetchBoth --> AnonExists{Guest cart exists?}
AnonExists -- No --> ReturnUser["Return account cart unchanged"]:::ok
AnonExists -- Yes --> UserExists{Account cart exists?}
UserExists -- No --> Rebind["Rebind guest cart to user_id<br/>(clear cart_token)"]:::ok
UserExists -- Yes --> MergeItems["For each guest SKU:<br/>skip discontinued<br/>take max(guest qty, account qty)<br/>skip if over size limit"]
MergeItems --> Commit["Replace account cart items<br/>Delete guest cart<br/>Write carts_merged row<br/>Invalidate Redis for both keys<br/>Clear cart_token cookie"]:::ok
classDef ok fill:#dcfce7,stroke:#15803d,color:#14532d
Show: the merge in code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def merge_carts(anonymous_token, user_id):
with db.transaction(isolation="serializable"):
anon_cart = db.fetch_cart(cart_token=anonymous_token, lock=True)
user_cart = db.fetch_cart(user_id=user_id, lock=True)
if anon_cart is None:
return user_cart
if user_cart is None:
db.update(anon_cart.id, user_id=user_id, cart_token=None)
audit_merge(user_id, anonymous_token, rule="rebind")
invalidate_redis(anonymous_token, user_id)
return db.fetch_cart(user_id=user_id)
merged = {item.sku: item.copy() for item in user_cart.items}
trimmed = []
for item in anon_cart.items:
if not catalog.is_available(item.sku):
trimmed.append({"sku": item.sku, "reason": "discontinued"})
continue
if item.sku in merged:
merged[item.sku].qty = min(
max(item.qty, merged[item.sku].qty), MAX_QTY_PER_ITEM
)
else:
if len(merged) >= MAX_CART_ITEMS:
trimmed.append({"sku": item.sku, "reason": "size_limit"})
continue
merged[item.sku] = item
db.replace_items(user_cart.id, merged.values())
db.delete(anon_cart.id)
audit_merge(user_id, anonymous_token, rule="qty:max", trimmed=trimmed)
invalidate_redis(anonymous_token, user_id)
return db.fetch_cart(user_id=user_id)
Three things make this safe:
- Serializable isolation. Alice double-clicks Log In. Two merge calls race. The second finds the guest cart deleted and returns the account cart unchanged. No duplicate merge.
- Audit always written. Storage is cheap. Support tickets are not.
- Cookie cleared in the response.
Set-Cookie: cart_token=; Max-Age=0. No re-merge on the next page load.
The classic mistake: doing the merge in the browser. The browser does not know the account cart, cannot enforce limits, and cannot run a transaction. Always server-side.
6. The architecture
flowchart TB
subgraph Edge["Client edge"]
C([Web / Mobile]):::user
GW["API Gateway<br/>(auth · cart_token · rate limit)"]:::edge
end
subgraph WritePath["Synchronous write path"]
CS["Cart Service<br/>(stateless pods)"]:::app
Cat["Catalog Service"]:::ext
Inv["Inventory Service"]:::ext
end
DB[("Postgres<br/>carts · cart_items · carts_merged")]:::db
R[("Redis<br/>cart:user:{uid}<br/>~5 MB active")]:::cache
K{{"Kafka<br/>cart.item.added · cart.merged<br/>cart.abandoned · cart.converted"}}:::queue
subgraph Consumers["Async consumers"]
AB["Abandoned cart emails"]:::app
AN[("Analytics<br/>(ClickHouse)")]:::db
FR["Fraud check"]:::app
ORD["Order Service"]:::app
end
C --> GW
GW --> CS
CS --> Cat
CS --> Inv
CS --> R
CS --> DB
DB -->|CDC / outbox| K
K --> AB
K --> AN
K --> FR
K --> ORD
classDef user fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
classDef edge fill:#e2e8f0,stroke:#475569,color:#1e293b
classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
classDef db fill:#fed7aa,stroke:#c2410c,color:#7c2d12
classDef cache fill:#fecaca,stroke:#b91c1c,color:#7f1d1d
classDef queue fill:#ddd6fe,stroke:#6d28d9,color:#4c1d95
classDef ext fill:#e9d5ff,stroke:#7e22ce,color:#581c87
Five things to notice:
- The cart service never writes to inventory. It reads. The inventory decrease happens at checkout in the Order Service. An inventory outage does not break add-to-cart.
- Catalog and inventory are called in parallel on cart read. Latency is
max(catalog, inventory), not the sum. - Redis holds the compact cart (SKU + qty + snapshot price). Catalog and inventory results are not cached there; they change too fast.
- Postgres is source of truth. Redis is an accelerator. If Redis loses data, Postgres repopulates it on the next cache miss.
- Notifications, analytics, and fraud sit downstream of Kafka. If any of them dies, carts still work.
7. A request, end to end
sequenceDiagram
autonumber
participant Alice
participant GW as API Gateway
participant CS as Cart Service
participant Cat as Catalog
participant Inv as Inventory
participant DB as Postgres
participant R as Redis
participant K as Kafka
Alice->>GW: POST /cart/items {sku: shoe-blue-42, qty: 1}<br/>Idempotency-Key: f3a1-...
GW->>CS: forward (auth ok, key not seen before)
CS->>Cat: price + name?
CS->>Inv: in stock?
Note over CS,Inv: parallel calls (~30 ms)
Cat-->>CS: $79, Blue Runner Size 42
Inv-->>CS: in stock
rect rgb(241, 245, 249)
Note over CS,DB: one transaction
CS->>DB: INSERT cart_items ON CONFLICT DO UPDATE qty
CS->>DB: UPDATE carts SET item_count = item_count + 1
CS->>DB: COMMIT
end
CS->>R: HSET cart:user:42 shoe-blue-42 {qty:1, price:7900} (~2 ms)
CS->>K: emit cart.item.added (fire and forget)
CS-->>Alice: 201 Created
Target latencies:
| Operation | P99 target |
|---|---|
| Cart icon count (Redis hit) | ~20 ms |
| Full cart read (parallel catalog + inventory) | ~80 ms |
| Add item (inventory round-trip is the bottleneck) | ~150 ms |
| Merge on login | ~200 ms |
8. Inventory strategy
Three options. The right default is optimistic.
| Option | Failure mode | Build cost | Use when |
|---|---|---|---|
| Optimistic (re-check at checkout) | 1-3% of checkouts find item gone last-second | Low | Default for most shops |
| Soft reservation (hold on add, TTL expiry) | Ghost carts make real stock look empty | High | Concert tickets, limited drops |
| No check (accept all, sort later) | “Cannot ship, refund coming” email | Near zero | Pre-orders, print-on-demand |
The division of responsibility: the cart shows good information. The Order Service makes the buy real. For SKUs the catalog flags requires_reservation=true, the cart calls Inventory to place a TTL hold and stores the hold_token on cart_items. The hold releases when the user removes the item, the TTL expires, or checkout converts it to a purchase.
Hold timer trade-offs:
| Timer | Effect |
|---|---|
| 5 min | Fewer ghost holds, more sold-out surprises at checkout |
| 15 min | Standard for limited drops, significant ghost hold problem at high abandonment |
| 30 min | Comfortable for users, blocks real buyers for half an hour |
For most shops: skip holds. Use optimistic. Reserve only at checkout, inside the Order Service, as an atomic step with payment.
9. The scaling journey: 10 users to 1 million
flowchart LR
S1["Stage 1<br/>10-100 users<br/>1 Postgres + 1 pod<br/>~$30/mo"]:::s1
S2["Stage 2<br/>~1,000 users<br/>+ merge endpoint<br/>+ catalog split<br/>~$150/mo"]:::s2
S3["Stage 3<br/>10K-100K users<br/>+ Redis · Kafka<br/>+ read replica<br/>~$1-2K/mo"]:::s3
S4["Stage 4<br/>1M users<br/>+ Redis sharding<br/>+ regional deployment<br/>~$10-20K/mo"]:::s4
S1 --> S2 --> S3 --> S4
classDef s1 fill:#e0f2fe,stroke:#0369a1,color:#0c4a6e
classDef s2 fill:#dcfce7,stroke:#15803d,color:#14532d
classDef s3 fill:#fef3c7,stroke:#a16207,color:#713f12
classDef s4 fill:#fce7f3,stroke:#be185d,color:#831843
Stage 1: 10 to 100 users
One Postgres, one app instance. Cart and catalog in the same app. Guest carts use a cart_token cookie. No Redis, no Kafka, no abandonment emails. Inventory is a SELECT on the products table. Ships in three days.
Stage 2: 1,000 users
What breaks: marketing wants abandoned-cart emails. People want phone-to-laptop sync. The catalog deserves its own service.
Split catalog (and eventually inventory) into their own services. Build the merge endpoint and the carts_merged audit table. Add a nightly job for abandoned carts. Still no Redis, still no Kafka. One Postgres read replica handles all reads.
Stage 3: 100,000 users
Several things break at once:
- Cart icon reads (~12/sec) appear in slow query logs.
- Cart page load is slow because each load joins with catalog over HTTP for 5+ items.
- A flash sale on limited stock shows “available” to 5,000 users when 100 pairs remain.
- Inventory has a 30-second blip. Every cart add fails because the service blocks on it.
Fixes in order: Redis as cart cache (write-through, 95%+ hit rate). Inventory check becomes best-effort with fallback to “show available, confirm at checkout.” Reservation only for flagged SKUs. Kafka replaces polling for downstream consumers. Nightly GC deletes expired anonymous carts.
Stage 4: 1 million users
New problems:
- Redis single-node failure means 25,000 empty-looking carts for ten seconds.
- Write contention on
item_countupdate surfaces under load. - Bots add 50 items per second across thousands of guest carts, hammering inventory.
- EU expansion requires regional data storage.
Shard Redis by hash(user_id) % N, one primary and one replica per shard. Rate-limit add-to-cart per IP and per cart_token. Regional deployment for EU with a local Redis and a Postgres replica. Async checkout: cart emits a frozen snapshot to Kafka, Order Service handles payment and the atomic reserve, user polls for order status.
The cart itself is comfortable at this scale. The bottleneck moves to inventory and checkout.
10. Reliability
Redis dies mid-day. Cart reads fall through to Postgres at ~40 ms instead of ~20 ms. Users see slightly slower pages; nobody loses their cart because Postgres is the truth. On recovery, the first read for each user repopulates Redis. A circuit breaker switches to “DB only” mode after N Redis failures. cart.redis.hit_rate drops to 0. Alert fires.
Postgres primary dies. Standard failover (30-60 seconds). Writes return 503 with Retry-After. Reads continue from replicas. Queued writes retry on recovery.
Inventory service goes down. Cart shows last-known availability or a “confirm at checkout” badge. Cart adds continue. More users hit a sold-out surprise at checkout during the outage. Acceptable.
Checkout starts and payment fails. The Order Service handles it. The cart does NOT clear until it receives a cart.converted event. Payment failure emits cart.checkout_failed. Holds release. User edits and retries.
Race between remove-item and in-flight checkout. Checkout took a frozen snapshot of the cart at the moment it started. The removal hits the live cart but does not affect the in-flight checkout. If checkout succeeds, snapshot items are bought. The live cart (minus purchased items) remains.
11. Observability
| Metric | Why it matters |
|---|---|
cart.icon_count.p99 | Tightest SLO. Runs on every page. Alert at >40 ms. |
cart.read.p99 | Cart page load. Alert at >200 ms. |
cart.write.p99 | Spike means DB contention. |
cart.redis.hit_rate | Should be >95%. Drop means shard imbalance or repopulation storm. |
cart.merge.rate | Spike means auth is broken and re-merging on every request. |
cart.merge.size_trimmed.rate | Non-zero often means the size limit is too low. |
inventory.check.timeout_rate | Drives the fallback path frequency. Alert at >5%. |
cart.abandonment.rate | Marketing’s headline metric. Alert on sudden >20% shift. |
cart.size.p99 | Bot signal if p99 > 50 items. |
kafka.cart_events.consumer_lag | If this grows, abandonment emails stop arriving. |
db.replication_lag.p99 | Read replicas must stay under 1 second. |
Page on: cart.icon_count.p99 > 40 ms for 5 min, redis.hit_rate < 80% for 5 min, Kafka lag > 5 min, cart write error rate > 2%.
Ticket on: merge rate spike, size-trimmed rate spike, inventory timeout rate > 5%.
12. Follow-up answers
1. Bots stuffing a cart with 10,000 items.
Hard size limits (100 items, 99 qty per SKU) return 422. Rate-limit POST /cart/items at 30/min per IP and 60/min per logged-in user. WAF rules at the edge for known bot user-agents. Shorter TTL for guest carts from IPs that never load a product page. For determined attackers, push detection upstream: CAPTCHA on suspicious checkout patterns, account-level fraud scoring.
2. Phone-to-laptop sync delay.
The phone’s add writes to Postgres and Redis immediately. The laptop’s next page load reads the updated state. No push needed. If the cart page is already open on the laptop, it shows stale data until refresh. To make it live: push a cart.item.added event over a WebSocket per user. More infrastructure for a marginal UX win. Most shops skip it.
3. Redis dies mid-day.
Circuit breaker switches to “DB only” mode after N Redis failures. Cart reads fall through to Postgres at ~40 ms. Nobody loses cart contents. On recovery, the first read for each user repopulates Redis via the cache-miss path. The cart.redis.hit_rate metric drops to 0 and alerts fire. Postgres is the truth; do not try to serve stale data from anywhere else.
4. Price went up.
Cart page shows current price ($89) with a note “was $79 when added.” At checkout, if the difference passes the threshold (10% or $5, whichever is smaller), the response includes price_change_acknowledgement_required: true. The UI shows a banner. The user confirms. The second checkout call carries price_change_acknowledged: true. The order record captures both prices. What Alice pays: always current. The snapshot is for display and audit only.
5. Abandoned cart detection.
Naive scan of all active carts every minute is slow at 100K+ carts. Use a narrow time-window query instead.
Show: the abandonment query
1
2
3
4
5
6
7
8
9
SELECT cart_id, user_id FROM carts
WHERE status = 'active'
AND user_id IS NOT NULL
AND updated_at >= NOW() - INTERVAL '6 hours 15 minutes'
AND updated_at < NOW() - INTERVAL '6 hours'
AND NOT EXISTS (
SELECT 1 FROM cart_abandonment_emails
WHERE cart_id = carts.cart_id
);
This touches only carts that just crossed the 6-hour threshold. The partial index on (updated_at) WHERE status = 'active' makes it fast. For each result, emit cart.abandoned to Kafka. The notification service consumes and sends. Record in cart_abandonment_emails to prevent duplicates.
At scale, swap the SQL for a Redis sorted set keyed by updated_at. Pop expired entries every 15 minutes. More efficient, more moving parts.
6. Anonymous carts pile up.
Anonymous carts get expires_at = NOW() + 30 days, refreshed on activity. A nightly GC job deletes expired guest rows. User returns after 90 days with the old cookie: the lookup returns nothing. The cart service issues a new token, sets the cookie, and returns an empty cart. No error shown. If they log in, there is nothing to merge; account cart loads normally.
7. Shared account, simultaneous edits.
Both sessions resolve to the same cart_id via the unique index on (user_id) WHERE status = 'active'. Adds use INSERT ... ON CONFLICT (cart_id, sku) DO UPDATE SET qty = qty + ?. Concurrent adds for the same SKU sum correctly (each is an explicit user action; sum is right here, unlike at merge). Removes use a plain DELETE; first-wins. Both users see each other’s edits on next page load.
8. Currency switch.
Cart stores snapshot_price_cents in the original transaction currency. Catalog returns prices in any requested currency at hydration time. When Alice switches to EUR, displayed prices recompute against the catalog’s current EUR prices. The snapshot stays in the original currency. At checkout, Alice is charged in the displayed currency. The order record captures both currencies for accounting. Never silently change the expected total without showing the user.
9. Item becomes restricted before checkout.
On cart page load, catalog and inventory return availability as restricted_in_region. The cart service displays the item with a “cannot ship to your address” badge. The checkout button disables until the item is removed. If the user somehow reaches checkout: the Order Service re-checks every item against the shipping address. Restricted items appear in the error response. No payment is attempted.
10. Save for later.
This belongs to a Wishlist Service. The cart holds items to buy; the wishlist holds items to remember. The interaction: the UI calls POST /wishlist/items {sku} then DELETE /cart/items/{sku}. Two calls, not atomic. If the wishlist add succeeds and the cart delete fails, the item sits in both (annoying, not broken). The UI retries the cart delete in the background. An alternative is one endpoint POST /cart/items/{sku}/move_to_wishlist that calls both internally. Larger shops keep the services separate.
11. Checkout race (two sessions, same user).
Both sessions share the same cart_id via the unique index. The idempotency key is per checkout attempt. A checkout call reads the cart, writes a checkout_sessions row with a unique (cart_id, session_id) constraint, and emits a frozen snapshot. The second checkout call finds the constraint already occupied and returns a conflict. Only one Order Service call happens. The lock is at the checkout_sessions row, not the cart.
12. Cart grows to 200 items.
Enforce the limit in the application layer, not just in the API response. On POST /cart/items, count current items before inserting. If item_count >= 100, return 422 with a clear error. The limit also lives as a constraint-check in the cart service config (not hardcoded) so the business can adjust it. Bot detection (question 1) is the more important line of defense; a 100-item limit alone does not stop a bot adding 1,000 items across 10 carts.
13. Trade-offs worth stating out loud
Cookie vs DB vs Redis. Cookie alone is too small (4 KB) and does not sync across devices. In-memory session does not scale past one server. DB alone gets slow on icon reads at scale. Redis+DB is right once DB reads appear in slow query logs. Start with DB-only. Add Redis when metrics demand it. Never Redis-only: you lose durability.
Optimistic vs reservation. Reservation on every add burns headroom for ghost carts (60-70% abandonment). Optimistic surprises 1-3% of buyers at the last checkout step. Mix: optimistic by default, reservation for explicitly flagged SKUs. That is the senior answer.
Sync vs async checkout. Synchronous checkout is simpler but couples cart latency to payment and fulfillment. Async checkout (frozen snapshot to Kafka, then Order Service) absorbs spikes and isolates failures. Trade-off: a “processing” page instead of instant confirmation. Sync is fine at small scale. Async is required on Black Friday.
Why one cart per user, not many. Some sites offer “birthday cart” or “work cart.” Multiple carts add significant complexity (which is active? merge across them? share with family?). Build it only when customers ask.
Why Postgres and not DynamoDB. ACID for merge and ON CONFLICT add. Analytical queries for abandonment detection. Small data volume. Postgres covers all three. DynamoDB would require a custom transaction layer for merge and a separate scan layer for abandonment.
What to revisit at 10M+ users. Physical shard Postgres by user_id. Move from Redis-as-cache to Redis-as-source-of-truth for active carts with periodic Postgres flushes. Push cart logic to CDN-adjacent workers for sub-50 ms global reads. Pre-aggregate the abandonment funnel in ClickHouse so the cart service is not running analytics queries directly.
14. Common mistakes
“Just store the cart in localStorage.” Misses multi-device sync, bot concerns, and the merge problem at login. Fine for a demo. Loses the design problem.
No merge discussion. Second-most-asked follow-up after inventory. Walk in with a stance: max-qty rule, one serializable transaction, audit row, clear the cookie.
“Reservation on add to cart for everything.” Common junior answer. Then the interviewer asks about 60-70% abandonment, ghost holds, and the design unravels. The right answer is optimistic by default with named exceptions.
Ignoring price drift. “Whatever price is in the cart is what they pay” is wrong and sometimes illegal. Snapshot vs current, surface the difference, require confirmation at threshold.
Checkout lives in the cart service. Checkout is its own service: payment, address check, atomic inventory reserve, order creation, post-purchase events. Cart hands off via a frozen snapshot.
Forgetting the icon read. Every page loads the cart count. That endpoint dominates QPS. Denormalize item_count, cache in Redis, target <20 ms p99.
Treating inventory as a hard dependency on add. If inventory is down, add-to-cart should degrade gracefully, not fail. Show a “confirm at checkout” badge and continue.
Designing for huge write throughput. Even at 1M DAU you see ~20 writes/sec. Do not propose Cassandra because “carts are write-heavy.” They are not.
No audit trail on merge. Without carts_merged, every “my cart is wrong after login” support ticket is unsolvable. Cheap to add. The data is irreplaceable.
The three signals that separate a strong answer from a generic CRUD answer: a confident merge policy (max-qty, serializable, audit), optimistic-by-default inventory with named exceptions, and explicit handling of price drift with acknowledgment at checkout.