Design a Help Desk Ticketing System (like a basic Zendesk)
What we are building
A customer emails support@acme.com: “My order is wrong, I got the wrong size.” The email arrives, the system creates a ticket, routes it to the Orders queue, and an agent picks it up. The agent replies, the customer writes back, and the conversation continues on the same thread until the issue is resolved. A timer runs the whole time, tracking whether the team responded within the one-hour window they promised.
That is Zendesk. It sounds like a to-do list with a status column. The interesting problems are hiding inside:
- Conversation threading. Is this incoming email a new problem, or a reply to a ticket from last week?
- Routing rules. Which team gets this ticket? Which agent within that team?
- SLA timers. The timer started Friday at 5 PM. Does it count Saturday and Sunday?
- Agent assignment races. Two agents click “Next Ticket” at the same instant. Who gets it?
- Multi-channel ingest. Email, web form, chat, Slack. Different wire formats, same internal model.
We will start with a 3-agent startup on one database, then scale to 2,000 agents across four regions. At each step, we name what just broke and add the smallest fix.
The lifecycle of one ticket
Every ticket travels through a small set of states. Picture it before drawing any boxes.
stateDiagram-v2
direction LR
[*] --> New: customer sends message
New --> Assigned: agent is picked
Assigned --> InProgress: agent starts work
InProgress --> WaitingOnCustomer: needs info (SLA pauses)
WaitingOnCustomer --> InProgress: customer replies (SLA resumes)
InProgress --> Resolved: agent marks fixed
Resolved --> Closed: confirmed or 72h auto-close
Closed --> Reopened: customer replies within 7 days
Reopened --> InProgress
New --> Spam
New --> Duplicate
A ticket spends most of its life cycling between InProgress and WaitingOnCustomer. Everything else (SLA, routing, caching, audit) exists to support that loop correctly.
Take this with you. A help desk is a state machine running on many conversations at once. The interesting problems are not about throughput. They are about correctness at every transition.
How big this gets
Same product, two very different companies.
| Company | Agents | Tickets/day | Writes/sec (peak) | Open at once |
|---|---|---|---|---|
| Startup | 3 | 50 | negligible | ~100 |
| Enterprise | 2,000 | 50,000 | ~15 | ~150,000 |
Show: how the numbers come out
Assume each ticket takes an average of 3 days to resolve, and 8 messages total (customer ask, agent reply, follow-ups, close).
Startup (50 tickets/day)
- 50 / 86,400 = ~0.0006 tickets/sec. Nearly nothing.
- 8 messages each = 400 messages/day.
- 50 tickets/day x 3-day average = ~100 open at any moment.
- 5 years of data: about 1 GB total. Fits anywhere.
Enterprise (50,000 tickets/day)
- 50,000 / 86,400 = ~0.6 tickets/sec average, ~3 at peak.
- 8 messages each = 400,000 messages/day, ~5/sec average, ~15 at peak.
- 50,000/day x 3-day average = ~150,000 open at any moment.
- 2,000 agents refreshing dashboards every 30 seconds = ~65 reads/sec.
- 5 years: ~5 TB of text, ~50 TB of attachments in S3.
What the math tells you. This is not a high-throughput system. Even at enterprise scale, you write about 15 things per second. A single Postgres handles that. The design pressure is on read latency (65 agent dashboard reads/sec) and correctness (no lost tickets, no double-assigns).
Reads beat writes about 10 to 1. The read path matters more than the write path.
The smallest version that works
Forget enterprise. Three agents, one channel: email. Tickets go to whoever is next in the rotation. No SLA timer yet.
flowchart LR
C([Alice]):::user --> API["Ticket Service"]:::app
API --> DB[("Postgres<br/>2 tables")]:::db
DB -.email.-> B([Bob]):::user
B --> API
classDef user fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
classDef db fill:#fed7aa,stroke:#c2410c,color:#7c2d12
Three endpoints carry the whole product.
| Endpoint | What it does |
|---|---|
POST /tickets | Create a new ticket from an intake event |
POST /tickets/{id}/messages | Add a reply (public) or internal note |
POST /tickets/{id}/status | Transition state (assign, resolve, close, reopen) |
Show: the two core tables
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
CREATE TABLE tickets (
ticket_id UUID PRIMARY KEY,
public_ref TEXT NOT NULL UNIQUE, -- "TKT-1234" shown to customer
subject TEXT NOT NULL,
customer_email TEXT NOT NULL,
assignee_id TEXT,
team_id TEXT,
status TEXT NOT NULL DEFAULT 'new',
priority TEXT NOT NULL DEFAULT 'medium',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
first_response_at TIMESTAMPTZ,
resolved_at TIMESTAMPTZ
);
CREATE TABLE ticket_messages (
message_id UUID PRIMARY KEY,
ticket_id UUID NOT NULL REFERENCES tickets(ticket_id),
author_type TEXT NOT NULL, -- 'customer' | 'agent' | 'system'
author_id TEXT,
visibility TEXT NOT NULL DEFAULT 'public',
body TEXT NOT NULL,
inbound_message_id TEXT, -- email Message-ID header
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE UNIQUE INDEX ON ticket_messages (inbound_message_id)
WHERE inbound_message_id IS NOT NULL;
The unique index on inbound_message_id is load-bearing: when the email adapter crashes and reprocesses the same message, the duplicate insert fails silently. No duplicate messages on the ticket.
Take this with you. Start with the smallest thing that works. The interview is really about what you add next, and why.
Decision 1: how do we thread email replies?
The first crack shows up fast. A customer replies to a closed ticket from two weeks ago. The adapter has no idea this is a reply. It creates a new ticket. The customer is furious because the new agent has no context.
Threading an inbound email into the correct ticket requires checking multiple signals in order of trust.
flowchart TD
A[raw email arrives] --> B{In-Reply-To matches<br/>a saved Message-ID?}
B -- yes --> D[append to existing ticket]:::ok
B -- no --> E{subject contains TKT-NNNN?}
E -- yes --> D
E -- no --> F{heuristic match?<br/>same sender + similar subject<br/>within 7 days}
F -- yes --> G[flag for agent review]:::ok
F -- no --> H[create new ticket]:::ok
classDef ok fill:#dcfce7,stroke:#15803d,color:#14532d
The three layers and what they catch:
| Layer | Signal | Catch rate | Risk |
|---|---|---|---|
| RFC 5322 headers | In-Reply-To / References | ~85% | Low, headers are reliable |
| Subject tag | [TKT-1234] in subject | ~10% | Low, tag is injected by us |
| Heuristic | Same sender, similar subject | ~3% | Medium, can merge unrelated tickets |
The heuristic layer is off by default. Turn it on only after the first two cover the easy cases.
When the adapter sends an outbound message, it saves the outgoing Message-ID. Inbound replies carry In-Reply-To: <that-message-id>. A single index lookup on ticket_messages.inbound_message_id finds the match.
Take this with you. Email threading by header is the single hardest part of a help desk. Build it in three layers: header match first, subject tag second, heuristics never on by default.
Decision 2: how do we pick an agent?
A ticket arrives. 200 agents are logged in. Who gets it?
flowchart LR
subgraph A["Option A: round robin"]
A1["Walk a pointer<br/>through agent list"]:::app
A2["Problem: ignores skill<br/>and current load"]:::bad
end
subgraph B["Option B: skill-based push"]
B1["Tag ticket at intake<br/>match to agent skill"]:::app
B2["Problem: silently misroutes<br/>when skill tags are stale"]:::bad
end
subgraph C["Option C: pull queue"]
C1["Ticket sits in queue<br/>agent clicks Next"]:::app
C2["Problem: agents cherry-pick<br/>easy tickets"]:::bad
end
subgraph D["Option D: load-balanced push"]
D1["Assign to agent with<br/>fewest open tickets"]:::app
D2["Problem: ignores ticket<br/>complexity"]:::bad
end
classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
classDef bad fill:#fecaca,stroke:#b91c1c,color:#7f1d1d
In practice, most systems combine B and D: skill-based to narrow the candidate pool, load-balanced to pick within that pool, pull queue as fallback when push fails.
The race to solve: two agents click “Next Ticket” at the same instant. Both queries return TKT-100. Both think it is theirs.
Fix: SELECT ... FOR UPDATE SKIP LOCKED in Postgres. Agent A locks TKT-100. Agent B’s query sees the lock, skips it, and gets TKT-101. Both succeed. No application-level coordination needed.
Show: pull-queue assignment with SKIP LOCKED
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def claim_next_ticket(agent):
with db.transaction():
ticket = db.query("""
SELECT ticket_id FROM tickets
WHERE team_id = ? AND assignee_id IS NULL
AND status IN ('new', 'assigned')
ORDER BY priority DESC, created_at ASC
FOR UPDATE SKIP LOCKED
LIMIT 1
""", agent.team_id)
if not ticket:
return None
db.update("tickets", ticket.id,
assignee_id=agent.id, status='assigned')
db.insert("assignments",
ticket_id=ticket.id, agent_id=agent.id, reason='pulled')
return ticket
Take this with you.
FOR UPDATE SKIP LOCKEDis the standard Postgres pattern for work queues. One query, no locks at the application layer, no double-assigns.
Decision 3: how do SLA timers work?
The CEO asks: “We promised enterprise customers a 1-hour first reply. How do we track that?”
Two things make this much harder than they look.
Problem 1: business hours. A ticket arrives Friday at 5 PM. The 1-hour SLA must pause at close of business and resume Monday at 9 AM. Deadlines are not created_at + 1 hour. They are created_at + 1 business hour, which requires walking forward through the team’s schedule.
Problem 2: clock pauses. When an agent moves a ticket to WaitingOnCustomer, the clock must pause. If it keeps running, every ticket that needs customer input eventually breaches through no fault of the agent.
flowchart TD
A[SLA worker wakes every 30s] --> B{query: tickets<br/>near deadline}
B --> C{clock paused?}
C -- yes --> D[skip]:::ok
C -- no --> E{how close?}
E -- 30 min away --> F[warn team lead]:::ok
E -- at deadline --> G[mark breached]:::bad
G --> H[raise priority]
G --> I[reassign to senior agent]
G --> J[page manager]
classDef ok fill:#dcfce7,stroke:#15803d,color:#14532d
classDef bad fill:#fecaca,stroke:#b91c1c,color:#7f1d1d
The SLA table is narrow by design. The worker scans it every 30 seconds with a tight index. Keeping SLA columns out of the wide tickets row makes the sweep fast.
Show: SLA table, business-hour math, and the sweep worker
1
2
3
4
5
6
7
8
9
10
11
12
13
14
CREATE TABLE sla_targets (
ticket_id UUID PRIMARY KEY REFERENCES tickets(ticket_id),
priority TEXT NOT NULL,
first_response_due TIMESTAMPTZ,
resolution_due TIMESTAMPTZ,
paused_at TIMESTAMPTZ,
pause_reason TEXT, -- 'waiting_on_customer' | 'out_of_hours'
business_hours_id TEXT NOT NULL,
first_response_at TIMESTAMPTZ,
resolved_at TIMESTAMPTZ,
breach_state TEXT NOT NULL DEFAULT 'on_track'
);
CREATE INDEX idx_sla_first ON sla_targets (first_response_due)
WHERE first_response_at IS NULL AND paused_at IS NULL;
Business-hour-aware deadline computation:
1
2
3
4
5
6
7
8
9
10
11
12
def add_business_time(start, duration, schedule):
cursor = start
remaining = duration
while remaining > 0:
if not schedule.is_business_time(cursor):
cursor = schedule.next_window_start(cursor)
continue
window_end = schedule.current_window_end(cursor)
chunk = min(remaining, window_end - cursor)
cursor += chunk
remaining -= chunk
return cursor
The sweep worker runs every 30 seconds:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def sla_sweep():
rows = db.query("""
SELECT ticket_id, first_response_due
FROM sla_targets
WHERE first_response_at IS NULL
AND paused_at IS NULL
AND first_response_due < NOW() + interval '30 minutes'
AND breach_state != 'breached'
""")
for row in rows:
if row.first_response_due < now():
emit("sla.breached", row.ticket_id)
else:
emit("sla.warning", row.ticket_id)
Pause when ticket moves to WaitingOnCustomer:
1
2
3
4
5
6
7
8
9
10
11
12
def pause_sla(ticket_id, reason):
db.update("sla_targets", ticket_id, paused_at=now(), pause_reason=reason)
def resume_sla(ticket_id):
target = db.get("sla_targets", ticket_id)
if target.paused_at is None:
return
paused_for = now() - target.paused_at
db.update("sla_targets", ticket_id,
paused_at=None,
first_response_due=target.first_response_due + paused_for,
resolution_due=target.resolution_due + paused_for)
Take this with you. SLA timers are not
created_at + N hours. They require business-hour-aware math, a pause/resume mechanism, and a worker that sweeps regularly. Get any one wrong and the metric becomes untrustworthy.
Decision 4: how do we keep notifications off the write path?
When a ticket is assigned, Bob gets a Slack DM and an email. If Slack is down, should the ticket creation fail? No.
The answer is to separate event emission from event delivery. The Ticket Service writes to Postgres and emits events to Kafka. A Notification Service consumes those events asynchronously and fans out to Slack, email, and push.
flowchart LR
E["Ticket Service"]:::app -->|INSERT + COMMIT| DB[("Postgres")]:::db
DB -->|CDC / outbox| K{{"Kafka<br/>ticket.created<br/>ticket.assigned<br/>sla.breached"}}:::queue
K --> N["Notification Service"]:::app
K --> IX["Search Indexer"]:::app
K --> Cold[("Audit cold tier<br/>S3 · 7yr")]:::db
N --> Slack["Slack DM"]:::ext
N --> Email["Email reply"]:::ext
classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
classDef db fill:#fed7aa,stroke:#c2410c,color:#7c2d12
classDef queue fill:#ddd6fe,stroke:#6d28d9,color:#4c1d95
classDef ext fill:#e9d5ff,stroke:#7e22ce,color:#581c87
If Slack goes down at 3 AM, new tickets still get created and assigned. Agents just do not get Slack DMs until the queue drains.
Take this with you. Anything reactive lives after Kafka, not before. Tie notifications to the write path and every Slack outage becomes a support outage.
The full architecture
Putting the four decisions together:
flowchart TB
subgraph Edge["Client edge"]
C([Web / Mobile / Slack]):::user
GW["API Gateway<br/>(auth · idempotency · rate limit)"]:::edge
end
subgraph WritePath["Synchronous write path"]
IA["Intake Adapters<br/>(email · form · chat · slack · sms)"]:::app
E["Ticket Service<br/>(stateless pods)"]:::app
AS["Assignment Service<br/>(skill + load + pull queue)"]:::app
SW["SLA Worker<br/>(sweeps every 30s)"]:::app
end
DB[("Postgres<br/>tickets · messages ·<br/>assignments · sla · audit (90d hot)")]:::db
subgraph ReadPath["Dashboard read path"]
R["Read Service"]:::app
Cache[("Redis<br/>agent dashboards")]:::cache
end
K{{"Kafka<br/>ticket.created<br/>ticket.replied<br/>ticket.assigned<br/>sla.breached"}}:::queue
subgraph Consumers["Async consumers"]
N["Notification Service"]:::app
IX["Search Indexer<br/>(Elasticsearch)"]:::app
Cold[("Audit cold tier<br/>S3 + Athena · 7yr")]:::db
end
C --> GW
GW -->|intake| IA
GW -->|read| R
IA --> E
E --> AS
E --> DB
SW --> DB
R --> Cache
R -.miss.-> DB
DB -->|CDC / outbox| K
K --> N
K --> IX
K --> Cold
classDef user fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
classDef edge fill:#e2e8f0,stroke:#475569,color:#1e293b
classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
classDef db fill:#fed7aa,stroke:#c2410c,color:#7c2d12
classDef cache fill:#fecaca,stroke:#b91c1c,color:#7f1d1d
classDef queue fill:#ddd6fe,stroke:#6d28d9,color:#4c1d95
Each component in one line:
| Component | Purpose |
|---|---|
| API Gateway | Authenticates callers, rate-limits bots, deduplicates mobile retries. |
| Intake Adapters | Parse channel-native format, thread replies into existing tickets, emit a common event. |
| Ticket Service | The state machine. Enforces valid transitions. Writes to Postgres transactionally. Stateless. |
| Assignment Service | “Which team? Which agent? Who is on shift right now?” |
| SLA Worker | Scans the SLA table every 30 seconds. Fires warnings and breach events. Pauses outside business hours. |
| Postgres | Source of truth. Live state plus 90 days of audit. |
| Read Service + Redis | Optimized for agent dashboards. Keeps the primary DB from being read constantly. |
| Kafka | Carries events to the async world. The boundary between synchronous writes and reactive consumers. |
| Notification, Search Indexer, Audit cold tier | Consumers. Not on the write path. If the notifier dies, tickets still flow. |
Walk: one ticket, all the way through
Alice emails support@acme.com about a wrong order.
sequenceDiagram
autonumber
participant Alice
participant EA as Email Adapter
participant GW as API Gateway
participant E as Ticket Service
participant AS as Assignment Svc
participant DB as Postgres
participant K as Kafka
participant N as Notification Svc
Alice->>EA: sends email (new subject, no In-Reply-To)
EA->>EA: parse headers, no existing ticket match
EA->>GW: POST /tickets (normalized intake event)
GW->>E: forward (auth ok, idempotency checked)
rect rgb(241, 245, 249)
Note over E,DB: one database transaction
E->>DB: INSERT ticket (state=new)
E->>DB: INSERT ticket_message
E->>DB: INSERT sla_targets (first_response_due = now + 1 biz hr)
E->>DB: INSERT ticket_audit
E->>DB: COMMIT
end
E->>AS: assign ticket
AS->>DB: SELECT FOR UPDATE SKIP LOCKED, pick least-busy agent (Bob)
AS-->>E: assigned to Bob
E->>K: emit ticket.created, ticket.assigned
E-->>GW: 201 Created
GW-->>Alice: ticket id + public URL (~300ms total)
K->>N: ticket.assigned event
N-->>Bob: Slack DM + email (fire and forget)
N-->>Alice: "we got your message" auto-reply
Three things to notice:
- The ticket, the message, the SLA row, and the audit entry are written in one transaction. A crash mid-write rolls back cleanly.
- Kafka is written after the commit. Slack notifications fan out from there. The write path does not wait for Slack.
- Alice gets her confirmation in ~300ms. Bob gets notified a few seconds later via Kafka.
Walk: SLA warning, then breach
48 minutes pass. Bob has not replied. The SLA worker wakes up.
sequenceDiagram
autonumber
participant SW as SLA Worker
participant DB as Postgres
participant K as Kafka
participant N as Notification Svc
participant TL as Team Lead
SW->>DB: query sla_targets WHERE first_response_due < NOW() + 30min
DB-->>SW: TKT-4521 (first_response_due in 12 min)
rect rgb(241, 245, 249)
Note over SW,DB: update breach_state, emit warning
SW->>DB: UPDATE sla_targets SET breach_state = 'at_risk'
SW->>K: emit sla.warning { ticket_id, minutes_remaining: 12 }
end
K->>N: sla.warning
N-->>TL: "TKT-4521 breaches in 12 minutes"
Note over SW: 12 minutes pass, no reply recorded
SW->>DB: query sla_targets (next sweep)
DB-->>SW: TKT-4521 (first_response_due < NOW())
rect rgb(241, 245, 249)
Note over SW,DB: mark breached
SW->>DB: UPDATE sla_targets SET breach_state = 'breached'
SW->>DB: UPDATE tickets SET priority = 'urgent'
SW->>K: emit sla.breached { ticket_id }
end
K->>N: sla.breached
N-->>TL: page + reassign to senior agent pool
The hard sub-problem: SLA windows with business hours across timezones
A customer in Tokyo submits a high-priority ticket at 11 PM Japan Standard Time. The support team is in San Francisco. It is 6 AM Pacific. Business hours have not started yet.
The SLA clock must not run during off-hours for the team. But it also must not confuse the customer’s timezone with the team’s timezone.
flowchart TB
subgraph Input["At ticket creation"]
T["ticket.created_at = 2026-05-28 11:00 PM JST<br/>(= 06:00 AM PST)"]
end
subgraph Schedule["Team schedule lookup"]
S["business_hours: SF team<br/>Mon-Fri 09:00-18:00 PST<br/>US holidays excluded"]
end
subgraph Compute["add_business_time(start, 1 hour, schedule)"]
C1["cursor = 06:00 AM PST (not in window)"]
C2["jump to 09:00 AM PST (window opens)"]
C3["advance 1 business hour"]
C4["first_response_due = 10:00 AM PST = 03:00 AM JST next day"]
end
subgraph Pause["When ticket moves to WaitingOnCustomer at 09:30 AM PST"]
P1["record paused_at = 09:30 AM PST"]
P2["stop SLA clock"]
end
subgraph Resume["Customer replies at 02:00 PM PST"]
R1["paused_for = 4.5 hours"]
R2["shift both deadlines forward by 4.5 hours"]
R3["new first_response_due = 02:30 PM PST"]
end
Input --> Schedule --> Compute --> Pause --> Resume
Two things the naive implementation gets wrong:
| Mistake | Consequence |
|---|---|
| Use wall-clock time instead of business hours | SLA breaches happen over the weekend when nobody is on shift |
| Forget to pause when waiting on customer | Agents game the system by closing and reopening tickets to reset the clock |
Take this with you. SLA correctness depends on three things working together: business-hour-aware deadline math, clock pauses on
WaitingOnCustomer, and a reliable sweep worker. Break any one and the metric becomes meaningless.
Follow-up questions
Try answering each in 2-3 sentences before opening the solution.
Email threading when the subject is stripped. A customer replies from a mobile app that removes the
[TKT-4521]prefix. The email also has noIn-Reply-Toheader. How do you still thread it into the right ticket?Intake adapter crashes mid-batch. The email adapter pulled 50 messages from IMAP and crashed after processing 30. On restart, how do you avoid reprocessing the first 30 and losing the last 20?
Two agents claim the same queued ticket. Both click “Next Ticket” within the same second. The query returns the same ticket to both. How do you make sure only one gets it?
SLA business hours across timezones. Customer in Tokyo, team in San Francisco. Whose hours apply? What if the contract says “follow the sun”?
Reopen after long silence. A customer replies to a ticket closed 6 months ago. What does the system do?
New issue from an existing customer. A customer with an open billing ticket emails again about a completely different problem (login broken). Do you append to the old ticket or open a new one?
Agent vacation handover. Bob has 47 open tickets and starts a 2-week vacation. How do you hand them off?
Spam at the front door. Your support email gets 10,000 spam messages a day. How do you stop them from becoming tickets?
Knowledge base suggestions at intake. When a customer fills out the web form, you want to show 3 relevant KB articles before they hit submit. How do you do this without slowing the form?
“Average time to resolve” is wrong. Management says the dashboard shows 2 hours, but tickets actually take days. What is wrong, and how do you fix it?
Related problems
- Approval Management (011). Same state-machine + role-routing + SLA-timer patterns. A ticket’s lifecycle is structurally identical to an approval’s lifecycle.
- Notification System (010). Every ticket state change (assigned, replied, breached, resolved) fans out to email, Slack, and push. The retry and quiet-hours machinery there is what consumes ticket events.
- Comment System (015). Ticket messages are shaped like comments: threaded, paginated, with attachments. The same storage and indexing patterns apply.
- Read-Heavy System Patterns (017). Agent dashboards and customer portals load tickets thousands of times per day. Cache tiering and read replicas from that problem apply directly here.
Try the problem on your own first. Solutions are most valuable after you've struggled with it.
Solution: Help Desk Ticketing System
The short version
A help desk is a small state machine with a messy front door.
The front door is the hard part. Emails arrive over IMAP with broken subject lines. Chats arrive over WebSockets. Web forms post JSON. Each channel needs an adapter that parses raw input, threads replies into existing tickets, and emits a common intake event.
Once a ticket exists, it follows a defined lifecycle: new -> assigned -> in_progress -> (waiting_on_customer loops) -> resolved -> closed. The Ticket Service enforces valid transitions and writes to Postgres. An Assignment Service picks the team and agent. An SLA Worker sweeps every 30 seconds, pauses outside business hours, and fires escalation events on breach.
Scale is not the hard part. Even at 50,000 tickets per day with 2,000 agents, you write about 15 things per second. The interesting work is correctness: email threading, SLA pause and resume across business hours, assignment races, and reopen logic after customer silence.
1. The two questions that matter most
Which channels? Email is the messiest because reply threading requires parsing RFC 5322 headers and falling back to subject tags. Every channel after email is just a new adapter. Adding Slack later means writing a Slack adapter, not changing the core system.
What does the SLA look like? Wall-clock SLA (“respond in 1 hour”) is three lines of code. Business-hour SLA across timezones is a real engineering project and the source of most reporting bugs in production.
Everything else (assignment strategy, escalation rules, audit retention) follows from those two answers.
2. The math, in plain numbers
| Scale | Tickets/day | Writes/sec (peak) | Open at once | Agent reads/sec |
|---|---|---|---|---|
| Startup (3 agents) | 50 | negligible | ~100 | ~3 |
| Enterprise (2,000 agents) | 50,000 | ~15 | ~150,000 | ~65 |
Three observations:
- 150,000 open tickets at any moment at enterprise scale. “Show me my pending tickets” must return in under 50ms.
- Reads beat writes 10 to 1. Every agent refreshes their dashboard a few times per minute. Caching the read path matters more than write throughput.
- 15 writes per second at peak. A single Postgres handles this. The architecture exists for correctness and read latency, not for QPS.
3. The API
Three endpoints carry the whole product. Create a ticket. Add a message. Change status.
1
2
3
4
5
6
7
8
9
POST /api/v1/tickets
Idempotency-Key: <uuid>
{
"subject": "Wrong order received",
"body": "I ordered size L but received size S...",
"channel": "email",
"customer_email": "alice@example.com"
}
1
2
3
4
5
6
POST /api/v1/tickets/{ticket_id}/messages
{
"body": "Can you share the order number?",
"visibility": "public" | "internal"
}
A public message also pushes out via the outbound adapter (email reply, Slack DM). An internal message is an agent-only note. The customer never sees it.
1
2
3
4
5
6
POST /api/v1/tickets/{ticket_id}/status
{
"status": "waiting_on_customer" | "resolved" | "closed",
"reason": "Asked for order number"
}
An invalid transition (like closed -> in_progress without going through reopened) returns 422 Unprocessable Entity.
Small but load-bearing choices:
| Choice | Reason |
|---|---|
Idempotency-Key required on create | SES retries on timeout. Mobile clients retry. Without the key, you create duplicate tickets. |
visibility: internal | The outbound adapter checks this flag before sending. Keeps agent war-room notes from reaching the customer. |
| Messages are the unit of conversation | Tickets group messages. Messages carry content. This separation lets you page through a long conversation without touching the ticket row. |
Status codes: 409 means idempotency key collision, return the existing ticket. 410 means the ticket was merged and this ID no longer exists.
4. The data model
erDiagram
tickets ||--o{ ticket_messages : has
tickets ||--|| sla_targets : "one per ticket"
tickets ||--o{ assignments : history
tickets ||--o{ ticket_audit : "many events"
tickets {
uuid ticket_id
text public_ref
text subject
text customer_email
text status
text assignee_id
text team_id
timestamptz created_at
}
ticket_messages {
uuid message_id
uuid ticket_id
text author_type
text visibility
text body
text inbound_message_id
}
sla_targets {
uuid ticket_id
timestamptz first_response_due
timestamptz resolution_due
timestamptz paused_at
text breach_state
}
ticket_audit {
uuid audit_id
uuid ticket_id
text event_type
jsonb payload
timestamptz occurred_at
}
Show: the full SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
CREATE TABLE tickets (
ticket_id UUID PRIMARY KEY,
public_ref TEXT NOT NULL UNIQUE,
subject TEXT NOT NULL,
customer_id TEXT,
customer_email TEXT,
channel TEXT NOT NULL,
channel_thread_id TEXT,
status TEXT NOT NULL DEFAULT 'new',
priority TEXT NOT NULL DEFAULT 'medium',
team_id TEXT,
assignee_id TEXT,
tags TEXT[] DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
first_response_at TIMESTAMPTZ,
resolved_at TIMESTAMPTZ,
closed_at TIMESTAMPTZ,
reopen_count INT NOT NULL DEFAULT 0,
parent_ticket_id UUID,
metadata JSONB
);
CREATE INDEX idx_tk_assignee_status ON tickets (assignee_id, status)
WHERE status NOT IN ('closed','spam','duplicate');
CREATE INDEX idx_tk_team_status ON tickets (team_id, status, priority);
CREATE INDEX idx_tk_customer ON tickets (customer_id, created_at DESC);
CREATE INDEX idx_tk_channel_thread ON tickets (channel, channel_thread_id);
CREATE TABLE ticket_messages (
message_id UUID PRIMARY KEY,
ticket_id UUID NOT NULL REFERENCES tickets(ticket_id),
author_type TEXT NOT NULL,
author_id TEXT,
visibility TEXT NOT NULL DEFAULT 'public',
body TEXT NOT NULL,
body_format TEXT NOT NULL DEFAULT 'plain',
inbound_channel TEXT,
inbound_message_id TEXT,
in_reply_to TEXT,
attachments JSONB DEFAULT '[]',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_msg_ticket ON ticket_messages (ticket_id, created_at);
CREATE UNIQUE INDEX idx_msg_inbound ON ticket_messages (inbound_message_id)
WHERE inbound_message_id IS NOT NULL;
CREATE TABLE assignments (
assignment_id UUID PRIMARY KEY,
ticket_id UUID NOT NULL REFERENCES tickets(ticket_id),
team_id TEXT NOT NULL,
agent_id TEXT,
assigned_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
unassigned_at TIMESTAMPTZ,
assigned_by TEXT,
reason TEXT
);
CREATE INDEX idx_assignments_agent ON assignments (agent_id) WHERE unassigned_at IS NULL;
CREATE TABLE sla_targets (
ticket_id UUID PRIMARY KEY REFERENCES tickets(ticket_id),
priority TEXT NOT NULL,
first_response_due TIMESTAMPTZ,
resolution_due TIMESTAMPTZ,
paused_at TIMESTAMPTZ,
pause_reason TEXT,
business_hours_id TEXT NOT NULL,
first_response_at TIMESTAMPTZ,
resolved_at TIMESTAMPTZ,
breach_state TEXT NOT NULL DEFAULT 'on_track'
);
CREATE INDEX idx_sla_first ON sla_targets (first_response_due)
WHERE first_response_at IS NULL AND paused_at IS NULL;
CREATE INDEX idx_sla_res ON sla_targets (resolution_due)
WHERE resolved_at IS NULL AND paused_at IS NULL;
CREATE TABLE ticket_audit (
audit_id UUID PRIMARY KEY,
ticket_id UUID NOT NULL,
occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
event_type TEXT NOT NULL,
actor_type TEXT NOT NULL,
actor_id TEXT,
payload JSONB NOT NULL
);
CREATE INDEX idx_audit_ticket ON ticket_audit (ticket_id, occurred_at);
Four small things doing real work:
Composite index (assignee_id, status). Serves the single most common query: “show me my open tickets.” Without it, every agent dashboard scans the full tickets table.
UNIQUE on ticket_messages.inbound_message_id. When the email adapter crashes and reprocesses the same message, the duplicate insert fails silently. No duplicate messages on the ticket.
assignments is history, not current state. The current assignee lives on the ticket row for fast reads. The full reassignment history lives in assignments for audit and analytics.
sla_targets is a separate narrow table. The SLA worker scans it every 30 seconds. Keeping SLA columns out of the wide tickets row makes the index small and the sweep fast.
Why Postgres and not Cassandra? State-machine transitions need ACID. When an agent replies, three things must happen together: insert the message, set first_response_at, and update sla_targets. Postgres gives you that in one transaction.
5. The intake engine
Email is the hardest channel. For each inbound message, the adapter checks three layers in order.
flowchart TD
A[raw email arrives] --> B[parse RFC 5322 headers]
B --> C{In-Reply-To matches<br/>a saved Message-ID?}
C -- yes --> D[append to existing ticket]:::ok
C -- no --> E{subject contains TKT-NNNN?}
E -- yes --> D
E -- no --> F[create new ticket]:::ok
classDef ok fill:#dcfce7,stroke:#15803d,color:#14532d
Show: the threading algorithm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def handle_inbound_email(raw_email):
msg = parse_email(raw_email)
if is_spam(msg):
record_dropped(msg, reason='spam')
return
# Layer 1: RFC 5322 headers
parent_id = msg.headers.get('In-Reply-To')
refs = msg.headers.get('References', '').split()
candidates = [parent_id] + refs if parent_id else refs
existing_ticket = None
for mid in candidates:
row = db.query(
"SELECT ticket_id FROM ticket_messages WHERE inbound_message_id = ?", mid)
if row:
existing_ticket = row.ticket_id
break
# Layer 2: subject tag
if not existing_ticket:
ref = extract_ticket_ref(msg.subject) # looks for [TKT-NNNN]
if ref:
existing_ticket = db.query(
"SELECT ticket_id FROM tickets WHERE public_ref = ?", ref)
# Layer 3: heuristic (off by default)
if existing_ticket:
emit("ticket.reply_received", existing_ticket, msg)
else:
emit("ticket.intake", channel='email', payload=normalize(msg))
The three layers in order of trust: header match (~85%), subject tag (~10%), heuristic (off by default). The remaining ~5% open new tickets. Agents merge duplicates via a “merge into” button.
Take this with you. If 5% of all replies open new tickets instead of threading, agents spend their day merging instead of answering.
6. The state machine
The Ticket Service enforces valid transitions. Any attempt to jump to an invalid state returns 422. The machine:
stateDiagram-v2
direction LR
[*] --> New: intake event
New --> Assigned: assignment picked
Assigned --> InProgress: agent starts work
InProgress --> WaitingOnCustomer: needs info (SLA pauses)
WaitingOnCustomer --> InProgress: customer replies (SLA resumes)
InProgress --> Resolved: agent marks fixed
Resolved --> Closed: confirmed or 72h auto-close
Closed --> Reopened: reply within 7 days
Reopened --> InProgress
New --> Spam
New --> Duplicate
InProgress --> Assigned: reassigned
Closed --> [*]
Show: the transition table in code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
VALID_TRANSITIONS = {
'new': {'assigned', 'spam', 'duplicate'},
'assigned': {'in_progress', 'new'},
'in_progress': {'waiting_on_customer', 'resolved', 'assigned'},
'waiting_on_customer': {'in_progress', 'resolved'},
'resolved': {'closed', 'in_progress'},
'closed': {'reopened'},
'reopened': {'in_progress'},
}
def transition(ticket, new_status):
if new_status not in VALID_TRANSITIONS.get(ticket.status, set()):
raise InvalidTransition(ticket.status, new_status)
with db.transaction():
db.update("tickets", ticket.id, status=new_status)
db.insert("ticket_audit", event_type="status_changed",
payload={"from": ticket.status, "to": new_status})
if new_status == 'waiting_on_customer':
pause_sla(ticket.id, reason='waiting_on_customer')
elif ticket.status == 'waiting_on_customer' and new_status == 'in_progress':
resume_sla(ticket.id)
7. The architecture
flowchart TB
subgraph Edge["Client edge"]
C([Web / Mobile / Slack]):::user
GW["API Gateway<br/>(auth · idempotency · rate limit)"]:::edge
end
subgraph WritePath["Synchronous write path"]
IA["Intake Adapters<br/>(email · form · chat · slack · sms)"]:::app
E["Ticket Service<br/>(stateless pods)"]:::app
AS["Assignment Service<br/>(skill + load + pull queue)"]:::app
SW["SLA Worker<br/>(sweeps every 30s)"]:::app
end
DB[("Postgres<br/>tickets · messages ·<br/>assignments · sla · audit (90d hot)")]:::db
subgraph ReadPath["Dashboard read path"]
R["Read Service"]:::app
Cache[("Redis")]:::cache
end
K{{"Kafka<br/>ticket.created<br/>ticket.replied<br/>ticket.assigned<br/>sla.breached"}}:::queue
subgraph Consumers["Async consumers"]
N["Notification Service"]:::app
IX["Search Indexer<br/>(Elasticsearch)"]:::app
Cold[("Audit cold tier<br/>S3 + Athena · 7yr")]:::db
end
C --> GW
GW -->|intake| IA
GW -->|read| R
IA --> E
E --> AS
E --> DB
SW --> DB
R --> Cache
R -.miss.-> DB
DB -->|CDC / outbox| K
K --> N
K --> IX
K --> Cold
classDef user fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
classDef edge fill:#e2e8f0,stroke:#475569,color:#1e293b
classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
classDef db fill:#fed7aa,stroke:#c2410c,color:#7c2d12
classDef cache fill:#fecaca,stroke:#b91c1c,color:#7f1d1d
classDef queue fill:#ddd6fe,stroke:#6d28d9,color:#4c1d95
Five things to notice:
- The Ticket Service is the only writer to Postgres. Everything else is downstream of Kafka. Notifier down? Tickets still get created and assigned. Emails just queue up.
- Intake adapters track their own progress (
last_processed_uidper IMAP mailbox). Crash and resume without losing or duplicating messages. - The Ticket Service is stateless. State lives in Postgres. Roll pods during the day with zero impact.
- The SLA Worker is a separate process. A slow sweep cannot stall the API. At large scale, shard by
ticket_idhash across N pods. - Audit lives in two tiers. Last 90 days in Postgres for fast queries. Older rows in S3 Parquet for compliance queries via Athena.
8. A ticket, end to end
sequenceDiagram
autonumber
participant Alice
participant EA as Email Adapter
participant E as Ticket Service
participant AS as Assignment Svc
participant DB as Postgres
participant K as Kafka
participant N as Notification Svc
Alice->>EA: sends email (new issue)
EA->>EA: parse headers, no ticket match found
EA->>E: POST /tickets (normalized intake event)
rect rgb(241, 245, 249)
Note over E,DB: one database transaction
E->>DB: INSERT ticket (state=new)
E->>DB: INSERT ticket_message
E->>DB: INSERT sla_targets
E->>DB: INSERT ticket_audit
E->>DB: COMMIT
end
E->>AS: assign (team=orders, pick agent)
AS->>DB: SELECT FOR UPDATE SKIP LOCKED, pick Bob (least busy)
AS-->>E: assigned to Bob
E->>K: emit ticket.created, ticket.assigned
E-->>Alice: 201 Created, ticket id + URL (~300ms)
K->>N: ticket.assigned event
N-->>Bob: Slack DM + email
N-->>Alice: "we got your message" auto-reply
Target latencies:
| Operation | P99 |
|---|---|
| Web form intake | ~300ms |
| Agent reply | ~200ms |
| Dashboard read (cache hit) | ~5ms |
| Dashboard read (cache miss) | ~50ms |
| Full-text search | ~150ms |
Recording an agent reply follows the same shape. The Ticket Service validates the transition, opens a transaction, inserts the message, updates first_response_at, updates sla_targets, writes an audit event, and commits. The Kafka event triggers the outbound adapter to send the reply email. Agent dashboard reads hit the Read Service, check Redis agent:{id}:open_tickets, fall through to Postgres read replica on miss.
9. The scaling journey: 10 users to 1 million
flowchart LR
S1["Stage 1<br/>3 agents, 50 tickets/day<br/>1 Postgres + 1 pod<br/>~$100/mo"]:::s1
S2["Stage 2<br/>20 agents, 500 tickets/day<br/>+ adapters + skill routing<br/>+ SLA worker<br/>~$300/mo"]:::s2
S3["Stage 3<br/>200 agents, 5k tickets/day<br/>+ Redis + Kafka<br/>+ Elasticsearch<br/>~$3-5k/mo"]:::s3
S4["Stage 4<br/>2,000 agents, 50k tickets/day<br/>+ multi-region<br/>+ DB sharding<br/>~$30-80k/mo"]:::s4
S1 --> S2 --> S3 --> S4
classDef s1 fill:#e0f2fe,stroke:#0369a1,color:#0c4a6e
classDef s2 fill:#dcfce7,stroke:#15803d,color:#14532d
classDef s3 fill:#fef3c7,stroke:#a16207,color:#713f12
classDef s4 fill:#fce7f3,stroke:#be185d,color:#831843
Stage 1: 3 agents, 50 tickets per day
One Postgres on a t3.medium. One Ticket Service process. One IMAP poller. Assignment is round-robin. SLA is a cron job every minute. Notifications are inline HTTP calls to SendGrid. About $100/month. Postgres is bored. Build nothing before it hurts.
Stage 2: 20 agents, 500 tickets per day
Something breaks: marketing adds a web form (need a second adapter). Billing and technical agents diverge (need skill-based routing). A customer on a $50k contract asked why their SLA was breached on a Sunday.
Add web form and chat adapters. Add skill-based routing via ticket tags matched to agent skill arrays. Add a dedicated SLA worker that understands business hours. Add ticket_audit. Still no Kafka, no Redis, no Elasticsearch. About $300/month.
Stage 3: 200 agents, 5,000 tickets per day
Several things break at once. Agent dashboards take 3 seconds. Full-text search takes 4 seconds. The SLA worker takes 45 seconds per sweep over 30k open tickets.
Add two Postgres read replicas. Add Redis for agent dashboard cache. Bring in Kafka properly. Pipe Postgres changes to Elasticsearch via Debezium. Shard the SLA worker by ticket_id hash. Add WebSockets so dashboards get live updates instead of polling. Cost: $3-5k/month.
Stage 4: 2,000 agents, 50,000 tickets per day, 4 regions
New problems: a long analytics query stalls the Ticket Service. EU mailbox intake from US-East adds 150ms latency. Enterprise customers require data residency.
Shard the tickets DB by team_id. Deploy per-region stacks (US, EU, APAC, AU). Each region has its own intake adapters, Ticket Service, DB shards, Redis, and Kafka. The core architecture has not changed since Stage 3. The data model is identical to Stage 1. Cost: $30-80k/month.
10. Reliability
Intake adapters use at-least-once delivery. Track last_processed_uid per IMAP mailbox. Commit only after the message is safely staged in Postgres. On crash and restart, resume from the last UID. Reprocessed messages are deduped by the unique index on ticket_messages.inbound_message_id.
The SLA Worker is stateless and idempotent. State lives in sla_targets. On restart, the next sweep picks up where the last left off. Worst case: 30 seconds of breach-detection lag. Run two workers active-active behind per-shard advisory locks.
Assignment races. Two assign_ticket calls for the same team: serialized by a team-level SELECT FOR UPDATE. Two pull-queue calls for the same ticket: serialized by SKIP LOCKED. No application-level coordination needed.
Notification failures. Kafka consumers run at-least-once with exponential backoff. Dead-letter topic after max retries, watched by an on-call dashboard.
Postgres failover. Primary dies, replica promotes. In-flight transactions roll back. Clients retry with idempotency keys. Workers reconnect within seconds.
11. Observability
| Metric | Why it matters |
|---|---|
tickets.created.rate by channel | Sudden drop means an adapter is broken. Spike means an outage or marketing blast. |
tickets.in_flight.count by team | Slow rise means a team is falling behind. |
time_to_first_response p50/p95 | The headline customer-facing SLO. |
time_to_resolution p50/p95 by priority | The other headline SLO. |
sla.breach.rate by priority and team | Per-team operational health. |
agent.utilization | Identifies overloaded agents and underused agents. |
assignment.latency p99 | High means the Assignment Service is contended on team-level locks. |
intake.threading.success_rate | Fraction of emails that thread into an existing ticket. Drop means headers are being stripped. |
kb.suggestion.deflection.rate | Customers who saw KB suggestions and closed the form without submitting. The self-service ROI metric. |
audit.write.lag | Audit must not lag. Alert at more than 5 seconds. |
Page on: any adapter offline for more than 2 minutes. SLA worker not running. Postgres primary unreachable.
Ticket on: time_to_first_response p95 regression more than 30%. sla.breach.rate doubles. Reopen rate above 10%.
12. Follow-up answers
1. Email threading when subject is stripped.
In-Reply-To and References headers are the primary signal. About 85% of replies hit this path. A single index lookup on ticket_messages.inbound_message_id finds the match. Subject tag [TKT-1234] is the fallback for corporate mail proxies that strip headers. If all three fail, a new ticket opens and the agent merges via “merge into.”
2. Intake adapter crashes mid-batch.
The adapter never advances last_processed_uid until the message is safely staged. On restart, it resumes from the last saved UID. Reprocessed messages fail silently on the unique constraint in ticket_messages. No message loss. No duplicate tickets.
3. Two agents claim the same queued ticket.
The pull query uses SELECT ... FOR UPDATE SKIP LOCKED LIMIT 1 inside a transaction. Agent A locks TKT-100 and writes assignee_id. Agent B’s identical query at the same instant skips TKT-100 (already locked) and gets TKT-101. Both commit. No conflict. This is the canonical Postgres use case for SKIP LOCKED.
4. SLA business hours across timezones.
Default: the team’s business hours schedule (for example, “US-West 9-6 weekdays, US holidays excluded”). Enterprise contracts that specify 24/7 coverage get business_hours_id = always. The contract policy lives on the customer record and overrides the team default at ticket creation time. For “follow the sun,” set the customer’s policy to always and let the assignment service route to whichever region is on shift.
5. Reopen after long silence.
Within 7 days of closed_at, a customer reply reactivates the original ticket: status goes to reopened, immediately to in_progress, and reopen_count increments. The SLA clock resets, optionally to a tighter reopen SLA (4 hours instead of 24).
Outside the 7-day window, a new ticket opens with parent_ticket_id pointing to the closed one. The agent UI surfaces the parent: “This customer had a related issue previously.”
6. New issue from a customer with an open ticket.
Open a new ticket by default. The unrelated-issue case is more common than the follow-up case. Threading logic decides: In-Reply-To pointing to a message in the existing ticket means append. Subject tag match means append. Anything else means a new ticket.
7. Agent vacation handover.
The agent sets out_of_office: {start, end, delegate_id} in their profile. On the start date, a job reassigns all open tickets to the delegate (or to the team queue if no delegate), records reason: vacation_handover in assignments, and notifies the delegate. For new tickets where the OOO agent would have been the round-robin pick, skip them in the rotation until the end date.
8. Spam at the front door.
Layered defense: SPF/DKIM/DMARC at SMTP rejects spoofed mail. SpamAssassin or AWS SES spam scoring sends high-score mail to a quarantine mailbox, not the ticket database. Per-sender rate limits drop bulk senders. A known-customer allowlist bypasses aggressive filtering. The 1% that gets through is closed as spam with a single click.
9. KB suggestions at intake.
As the customer types, debounce 500ms then call POST /api/v1/kb/suggest with the current subject and body. The endpoint queries Elasticsearch (top 3 articles by relevance) and returns within 50ms. Display inline above the submit button. Track whether the customer submits anyway (kb.suggestion.deflection). Mature systems deflect 20-40% of would-be tickets.
10. “Average time to resolve” is wrong.
Three common bugs. First: it includes waiting_on_customer time, but the agent was not working then. Subtract time in that state. Second: it uses wall-clock time and ignores business hours. A ticket created Friday 5 PM and resolved Monday 10 AM is not 65 hours; it is about 3 business hours. Third: it counts spam and duplicate tickets, which close instantly and drag the average down. Exclude them. Show p50 and p95 alongside the average. The average alone hides long-tail outliers.
13. Trade-offs worth saying out loud
Push vs pull assignment. Push gives every ticket an owner instantly, which is better for SLA accountability. Pull avoids assigning to agents who just logged off and preserves agent autonomy. Most production setups push by default with pull as fallback when the pushed agent is unavailable.
Postgres full-text search vs Elasticsearch. Postgres FTS with a tsvector GIN index handles up to about 1 million tickets cheaply. Elasticsearch is a separate cluster to operate but gives faceted search, fuzzy matching, and sub-100ms relevance ranking. Switch when FTS latency exceeds 500ms or when the product asks for facets.
Per-ticket scheduled timers vs sweep worker. Per-ticket timers sound clean but cancellation on resolve or pause becomes brittle at 150k open tickets. Sweep is boring, reliable, and easy to debug. The cost is up to 30 seconds of breach-detection lag. Nobody notices.
Custom build vs Zendesk. If support volume is small, buy. Build when you have unusual integrations, a workflow that does not fit the SaaS model (regulated industries, embedded support, deep cross-system automation), or volume large enough that the per-agent bill exceeds the engineering cost.
14. Common mistakes
Tickets as CRUD with a status column. If any agent can flip status to any value, you have a Trello clone. The state machine is the system.
No intake adapter layer. Jumping straight to “the API takes JSON” misses the entire email-parsing and threading problem. That is the single hardest part of the system.
Round-robin assignment at any scale. Works for teams under 10 people. Breaks at 20+ when skill matters and at 100+ when load matters.
Wall-clock SLA with no business hours. Most candidates skip this. Real enterprise customers will not tolerate a 24-hour SLA that expires Sunday at 3 AM.
No SLA pause for waiting_on_customer. Without pause, every ticket that needs customer input eventually breaches. Agents learn to game it. The metric loses all meaning.
No reopen handling. Customer replies to a closed ticket, a new ticket opens, conversation history is lost. Build the 7-day reopen window from day one.
Email threading by subject only. Subjects get rewritten, prefixed with “FW:” or “RE:”, and translated. Use In-Reply-To as the primary signal.
No assignment race control. Two agents claim the same ticket without SKIP LOCKED and you get split-brain assignments. One agent works the ticket. The other also works it. The customer gets two conflicting replies.
Notifications glued to the Ticket Service. Tickets emit events. The notification service consumes them. Coupling them means you cannot swap Slack for Teams without touching the core service.
If you can name seven of these nine unprompted, you are interviewing at senior or staff level. The three that separate strong from average: state machine over CRUD, email threading by headers not subject, and business-hour SLA pausing.