Problem #26 Hard System Design

Delivery Idle Driver Tracking

streamingH3TTLgeospatial

Scenario: A food delivery company has 30,000 active drivers in a city. The dispatch system needs to know, in near real time, which drivers are currently idle and where they are, so it can offer them the next order. “Idle” means online, not on a trip, and not on a break. Data can’t be more than 10 seconds stale, and dispatch needs to answer “show me all idle drivers within 1.5 km of this restaurant” in well under a second.

In the interview, the question is:

Design how a delivery company knows in near real time which drivers are idle and where they are.

Your Task:

Decide what state to track and where.
Pick a streaming pipeline.
Pick a spatial index for “drivers near point X.”
Cover the failure modes: app crash, no GPS, off-trip-but-not-really.
Sketch how dispatch queries this.

What a good answer covers:

A driver state machine and how it changes.
A last-known-location store with TTL.
A geospatial query (geohash or H3 neighbor lookup).
The stale-driver problem and how to expire safely.
Why you don’t put this in your warehouse.

Try the problem on your own first. Solutions are most valuable after you've struggled with it.

Solution 26: Delivery Idle Driver Tracking

The shape of the system

flowchart TB
    DA([Driver app<br/>ping every 5s])
    TR([Trip system<br/>trip events])
    KP([Kafka: driver_ping])
    KT([Kafka: trip_status])
    FL([Flink stream processor<br/>keyed by driver_id<br/>state: online / on_trip / break / offline<br/>emits idle drivers keyed by H3])
    STORE([Live driver store<br/>Redis / DragonflyDB<br/>per-driver hash with TTL 60s<br/>H3 index to driver set])
    DISP([Dispatch service<br/>query: idle drivers within 1.5 km<br/>resolves to hex + neighbors])

    DA --> KP --> FL
    TR --> KT --> FL
    FL --> STORE --> DISP

    style DA fill:#dcfce7,stroke:#15803d,color:#14532d
    style TR fill:#dcfce7,stroke:#15803d,color:#14532d
    style KP fill:#fed7aa,stroke:#c2410c,color:#7c2d12
    style KT fill:#fed7aa,stroke:#c2410c,color:#7c2d12
    style FL fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    style STORE fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    style DISP fill:#fed7aa,stroke:#c2410c,color:#7c2d12

The driver state machine

Drivers move between four states:

stateDiagram-v2
    [*] --> offline
    offline --> online: login
    online --> offline: logout
    online --> on_trip: start_trip
    on_trip --> online: end_trip
    online --> break: break_start
    break --> online: break_end

Only online is idle.

Only online is idle. The stream processor maintains this state per driver from two streams:

driver_ping (every 5 sec, with location, app says “online or break”)
trip_status (trip_started, trip_completed)

The processor’s keyed state per driver holds: state, last_lat, last_lng, last_ping_at. Each event updates the state.

Why streaming, not a database

You could have the app write directly to Postgres. Don’t.

30,000 drivers at one write every 5 seconds is 6,000 writes/sec. That’s hot.
Trips happen continuously. Writes amplify.
Locking and ACID overhead destroy throughput.

A stream plus an in-memory store handles this comfortably.

The location index

For “drivers within 1.5 km of point X,” we need a spatial index. Two practical choices:

H3 hex grid (recommended). Resolution 9 (~0.10 km² hexes) or 8 (~0.74 km²). Each driver’s location maps to a hex. A query for “within 1.5 km” expands to “this hex plus its neighbors out to N rings.”

  
import h3
center = h3.geo_to_h3(lat, lng, resolution=9)
nearby = h3.k_ring(center, 3)   # this hex and rings 1..3

Then we read all drivers whose stored hex is in nearby and filter by exact distance.

Geohash. Same idea with rectangular cells. Works fine at moderate scale. H3 wins on neighbor math.

The store layout

In Redis terms:

# per-driver state
HSET driver:42 state online lat 1.3245 lng 103.8512 h3 89...3fff last_ping 1715680000
EXPIRE driver:42 60

# per-hex index of idle drivers
SADD idle_drivers:h3:89...3fff 42 91 132
EXPIRE idle_drivers:h3:89...3fff 60

The TTL is the key safety net. If a driver disappears (app crash, dead battery), their entries expire within 60 seconds. No separate cleanup job needed.

When the stream processor updates a driver:

Move them out of the old hex set if they moved.
Add them to the new hex set if they’re still idle.
Refresh the per-driver hash.

If the driver starts a trip, remove them from any hex set. They’re no longer idle.

Query path for dispatch

When a new order comes in at the restaurant location:

Compute the restaurant’s H3 hex.
Get the k_ring of hexes covering the search radius (1.5 km is roughly k=3 at resolution 9).
Read the idle_drivers:h3:* sets for each hex (one Redis MGET).
Hydrate each driver’s lat/lng from driver:<id> hashes.
Compute exact distance, filter by 1.5 km, sort by distance.
Return the top N.

Total latency: usually 5-20 ms.

Handling failure modes

App crashes. Pings stop. After 60 seconds the TTL fires and the driver disappears from queries. No bad dispatch.

No GPS / bad GPS. Pings still arrive with a quality flag. The stream processor flags poor-quality positions and either holds the last good location for up to 30 seconds or drops the driver out of idle.

Driver says “online” but they’re actually on a trip. The trip system is the source of truth for on_trip. The processor uses trip events to override the app’s claim. If a trip event says trip_started, we set the driver to on_trip regardless of what the ping says.

Network split between app and dispatch. The driver shows online for the duration of the TTL, then disappears. Dispatch sees a smaller pool and offers fewer matches, which is the right failure mode.

Stream processor crash. Flink resumes from checkpoint. State is reconstructed. The store may briefly hold stale records, but the TTL bounds it.

Why not just query the warehouse?

Latency. Warehouse queries are seconds.
Cost. Per-query cost adds up at thousands of dispatches per second.
Concurrency. Warehouses are not built for the read pattern.

The warehouse is still in the picture for analytics (“how many idle drivers did we have at 6 PM yesterday?”), fed from the same Kafka topics.

Capacity estimate

30,000 drivers × 1 ping/5 sec = 6,000 writes/sec. Read side: dispatch queries during peak meals, maybe 200 orders/sec, each doing a small read. Trivial. Store size: 30,000 entries × ~150 bytes = ~5 MB. Plus a hex index over maybe 5,000 hexes. Tiny.

All this fits on a single Redis cluster easily. For a 100x larger company, the same shape holds, you just shard by city or driver_id.

What about long-running idle drivers parking somewhere?

Some drivers park and wait. They keep sending pings but never move. The system correctly counts them as idle. The dispatch service may want to prefer drivers who have been idle longer (fairness) or shorter (faster pickup). Either is a small change in the query: include last_ping and idle_since in the response, let dispatch’s algorithm pick.

Common mistakes interviewers want you to name

Storing driver state in the OLTP database. Hot row contention, slow reads.
No TTL. A crashed app means the driver is “online forever” in the store.
Reading from the warehouse. Latency and cost both blow up.
Trusting the app’s state claim. Trip events must override.
Spatial query that scans all drivers. Cell-based index is essential.

Bonus follow-up the interviewer might throw

“How would you change this if you wanted to predict idle drivers, not just track them?”

Same data flows, plus the warehouse historical view. A model can learn that drivers in hex X tend to be idle at 2 PM on weekdays. Dispatch can pre-warm offers there. The live system stays the same; the prediction is just an extra hint, never the source of truth.