Problem #22 Hard System Design

Banking App Monthly Spending Widget

streamingCDCserving storelow latency

Scenario: A retail bank wants a “your spending this month” widget on the home screen of its mobile app. When the user opens the app, they should see their total spend so far this month, broken down into categories like groceries, transport, dining. It needs to feel instant. The bank has 5 million active customers and millions of transactions per day.

In the interview, the question is:

Sketch what sits behind a “your monthly spending” widget in a banking app, where users expect it to feel instant.

Your Task:

Decide what “feels instant” actually means and design to that.
Pick a storage choice for the read path and explain why.
Show how the data flows from a card swipe to the widget.
Cover the merchant-to-category problem.
Mention how you would handle refunds and corrections.

What a Good Answer Covers:

Read-optimized serving store, not the warehouse.
Incremental updates per transaction (CDC or stream).
Pre-aggregated monthly totals per (user, category).
Idempotency and refund handling.
Cold cache vs warm cache, and the first-of-the-month problem.

Try the problem on your own first. Solutions are most valuable after you've struggled with it.

What “instant” means

“Instant” on a phone means the widget renders in under 200 ms after the screen opens. The data fetch itself has maybe 80-100 ms of that budget after we account for TLS, app rendering, and the rest. So the backend has to answer in roughly 50 ms at p95.

That number drives every design choice. A warehouse query takes seconds. A relational database can do it. A key-value store does it easily. So we are not going to query the warehouse from the phone, ever. We are going to maintain a small pre-computed answer in a fast store.

The shape of the system

flowchart TB
    SW([Card swipe<br/>or digital payment])
    CORE([Core banking transactions DB<br/>source of truth, OLTP])
    CDC([CDC stream<br/>Debezium reading the WAL])
    K([Kafka topic: transactions])
    SP([Stream processor<br/>Flink or Kafka Streams<br/>look up category, apply sign,<br/>key by user, month, category])
    SS([Fast serving store<br/>DynamoDB, Bigtable, Redis<br/>key: user_id | YYYY-MM])
    APP([Mobile app widget<br/>point read under 50 ms])

    WH([S3 then warehouse<br/>BigQuery or Snowflake<br/>for dispute and reporting])

    SW --> CORE --> CDC --> K --> SP --> SS --> APP
    K -. parallel drain .-> WH

    style SW fill:#dcfce7,stroke:#15803d,color:#14532d
    style CORE fill:#fef3c7,stroke:#a16207,color:#713f12
    style CDC fill:#fed7aa,stroke:#c2410c,color:#7c2d12
    style K fill:#fed7aa,stroke:#c2410c,color:#7c2d12
    style SP fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    style SS fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    style APP fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    style WH fill:#fef3c7,stroke:#a16207,color:#713f12

Data shape in the serving store

Key:    user_id | YYYY-MM
Value:  {
  total:      4321.50,
  categories: {
    groceries: 850.20,
    transport: 130.00,
    dining:    295.75,
    bills:    1820.00,
    other:    1225.55
  },
  txn_count:    78,
  updated_at:   "2025-05-14T08:21:12Z",
  last_seq:     991283
}

One record per user per month. Reading it is a single point lookup. ~5 KB per user. 5 million users × 12 months would be a few hundred GB at most. Cheap.

How a transaction flows

Customer pays at a merchant. The core banking system writes the transaction row. This is the moment of truth.
CDC (Debezium reading the Postgres WAL, or its mainframe equivalent) emits the row to Kafka within a second or two.
A stream processor reads the Kafka topic. For each transaction:
- Look up the merchant category from a small in-memory map (categories rarely change, refreshed every hour).
- Compute the signed amount: positive for purchases, negative for refunds.
- Update the (user, year-month, category) total.
The new total is written to the fast serving store.
Next time the user opens the app, the widget reads the row in one call.

End to end latency from swipe to widget refresh: typically 2-5 seconds. The user does not see latency because they were not staring at the app at the moment of swipe.

The merchant-to-category problem

The biggest soft problem in this pipeline. A swipe at “TESCO SG #421” should become “groceries.” A swipe at “GRAB *XR23F” should become “transport.” This mapping is messy and changes over time.

How I would handle it:

A merchant_categories table keyed by merchant id, maintained by a small team. Loaded into the stream processor as a broadcast state, refreshed hourly.
A fallback rule engine for unknown merchants: pattern match on the descriptor string (“UBER” → transport, “STARBUCKS” → dining).
An “other” bucket for genuinely unmapped transactions. The widget shows them but they get reviewed weekly.

When a category changes (the mapping is fixed), I would NOT retroactively change past months. The widget shows what was said at the time. Otherwise the user sees their March number change in May, which feels wrong.

Handling refunds and corrections

Refunds are just negative transactions. The stream processor adds them and the totals go down. Important: if the original purchase happened in April and the refund happens in May, which month does the refund belong to?

Two conventions, pick one:

Banking convention (post date): the refund hits whatever month it cleared in. So the May refund reduces the May total.
Spending convention (transaction date): the refund reduces the month the original purchase was in. So the May refund reduces the April total retroactively.

Banks usually go with post date, because that’s how the statement reads. The widget should match the statement.

Cold start and the first-of-the-month problem

On the first of the month, every user has a fresh empty record. The very first read after midnight needs to materialize the empty row. Two ways:

Lazy: the widget service checks the store, finds nothing, returns zeros, and the row gets created on the first transaction. Simple but produces a brief “your spending: 0” until the first txn arrives, which is correct anyway.
Eager: a scheduled job at midnight of the user’s local time zone creates an empty record. Slightly nicer UX but more code.

Lazy is fine for a v1.

Idempotency

CDC can deliver the same transaction twice (Kafka retries, processor restarts). The stream processor must dedupe by transaction_id. Two patterns:

Keep the last last_seq processed in the same record. Skip anything ≤ that.
Use Flink exactly-once with checkpointing.

Either way, the rule is: applying the same transaction twice produces the same total once. This is the same idempotency rule from Problem 9, just inside a streaming processor.

Why not query the warehouse directly?

Warehouse query latency is seconds, not milliseconds.
Warehouse is not built for thousands of concurrent point reads.
Cost scales with reads. Each widget render would cost real money at 5 million daily active users.

The warehouse is still part of the picture, but for analytics, disputes, and historical reports, not for the live widget.

What the warehouse layer does

The same Kafka topic drains to S3 (Firehose) and then to BigQuery or Snowflake. The warehouse hosts:

Full transaction history (years).
Analytics: “average spend per customer per category.”
Risk and fraud models.
Customer support: “show me this user’s full statement.”

The widget never touches it.

Common mistakes interviewers want you to name

Pointing the widget at the warehouse. Lots of teams try this. Latency, cost, and concurrency all bite.
Recomputing the monthly total on every read (“aggregate the last 30 days from a transactions cache”). Even on a fast store, this is expensive at the first-of-month-just-after-payday spike.
Forgetting to dedupe. CDC retries are normal. A doubled total is a customer complaint.
Mapping changes that rewrite history. Surprising the user is worse than a bad mapping.
No fallback for when the serving store is down. If the widget is critical, fall back to a stale cached number, not a spinner forever.

Bonus follow-up the interviewer might throw

“How would you make this work offline, when the phone has no network?”

Cache the last successful response on the phone with a short TTL (an hour). Show it with a small “as of 8:21” timestamp. The user sees their last known number instead of an error, and the freshness label is honest. When the phone reconnects, refresh in the background.