System Design Roadmap

System Design Roadmap

System Design Roadmap

Six months. Seven stages. From basics to designing Instagram on a whiteboard.

A real, ordered learning path. Each stage builds on the last one. You finish each stage by building something small, not by memorising slides.

Looking for a single concept? The System Design Concepts library has short, scenario-driven answers to 70 common questions. Use it as a quick lookup when a stage below mentions something unfamiliar.

How to read this page

Read top to bottom. Do the stages in order. The order matters more than people think. If you try to learn Kafka before you understand what TCP is, you will get stuck and quit. Most people who fail at system design fail because they skipped the boring basics.

Two paces:

  • If you already write code at work, plan on four months.
  • If you are still learning to code, plan on eight months.

Either way, the structure is the same.


The journey, in one picture

flowchart LR
    S1[1. Foundations<br/>Month 1]:::a
    S2[2. Storage<br/>Month 2]:::a
    S3[3. Caching and queues<br/>Month 3]:::b
    S4[4. Scaling and reliability<br/>Month 4]:::b
    S5[5. Distributed systems<br/>Month 5]:::c
    S6[6. Real architectures<br/>Month 6]:::c
    S7[7. Interview craft<br/>parallel from week 1]:::d

    S1 --> S2 --> S3 --> S4 --> S5 --> S6
    S7 -.- S6

    classDef a fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    classDef b fill:#dcfce7,stroke:#15803d,color:#14532d
    classDef c fill:#fed7aa,stroke:#c2410c,color:#7c2d12
    classDef d fill:#fef3c7,stroke:#a16207,color:#713f12

Stages 1 to 6 are sequential. Stage 7 runs alongside the whole thing.


What you can do at each level

A quick honesty check. Where are you now, where do you want to be?

LevelWhat you can doWhat people pay you for
JuniorBuild a CRUD app. Add a database. Deploy it.Writing features against an existing design.
MidAdd a cache. Add a queue. Read a slow query plan. Recover a backup.Owning a service end to end.
SeniorDesign a system from scratch. Know when to pick SQL vs NoSQL. Spot the bottleneck before it ships.Designing systems other engineers will build.
StaffDesign across multiple teams. Predict failure modes you have never seen. Have an opinion on every trade-off.Setting direction. Catching the failure mode nobody else sees.

By the end of Stage 4 you are mid-level. By the end of Stage 6 you are senior. Staff comes from production scars, not from a roadmap.


Stage 1: Foundations

Goal. Learn the vocabulary. You cannot design anything if you do not know what a server, a port, or a DNS record actually is.

The picture in your head.

flowchart LR
    U(["You"]):::u -->|"https://"| D[("DNS")]:::ext
    D -->|"IP address"| S(["Server"]):::s
    U -->|"HTTP request"| S
    S -->|"HTTP response"| U

    classDef u fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    classDef s fill:#dcfce7,stroke:#15803d,color:#14532d
    classDef ext fill:#e9d5ff,stroke:#7e22ce,color:#581c87

Topics.

GroupTopics
Networking basicsWhat an IP address is. Ports. TCP vs UDP. Why TCP needs a handshake.
The web stackDNS. HTTP methods. HTTP status codes. Headers. Why HTTPS is different.
Mental modelsLatency vs throughput. The four nines. Synchronous vs asynchronous.
Useful numbers to memoriseMemory access: 100ns. SSD read: 100us. Disk seek: 10ms. Cross-region network: 100ms.
API stylesREST. RPC. gRPC. GraphQL. WebSocket. When each one fits.

Build this in week 4. A tiny HTTP server in any language. Two endpoints: POST /links and GET /:id. Store data in a JSON file. Deploy it on a free tier (Fly, Render, Railway).

You are done when you can look at a cloud architecture diagram and explain every box and every arrow out loud.


Stage 2: Storage and data

Goal. Decide where data lives. The biggest single decision in any design.

The picture in your head.

flowchart LR
    S(["Server"]):::s -->|"reads + writes"| DB[("Primary DB")]:::db
    DB -->|"replication"| R1[("Replica 1<br/>read-only")]:::db
    DB -->|"replication"| R2[("Replica 2<br/>read-only")]:::db
    S -.->|"heavy reads"| R1

    classDef s fill:#dcfce7,stroke:#15803d,color:#14532d
    classDef db fill:#fed7aa,stroke:#c2410c,color:#7c2d12

Topics.

GroupTopics
Relational basicsTables, rows, columns. Primary keys. Foreign keys. JOINs.
IndexesB-tree indexes. Composite indexes. Reading an EXPLAIN plan.
TransactionsACID. Isolation levels. What a deadlock is. Long-running transactions.
NoSQL familiesKey-value (Redis, DynamoDB). Document (MongoDB). Wide-column (Cassandra). Search (Elasticsearch).
ReplicationLeader-follower. Replication lag. Read-your-writes.
ShardingRange vs hash sharding. Hot shards. Re-sharding pain.
Consistency modelsStrong. Eventual. Causal. Read-your-writes.
Storage enginesB-trees vs LSM trees. Why your DB choice changes write speed by 10x.

Build this in week 8. Take your Stage 1 service. Move the JSON file to Postgres. Add one index. Run EXPLAIN on a query and read the plan. Add one slow query that scans the whole table. Watch the latency.

You are done when someone asks “should this be SQL or NoSQL?” and your answer is a list of follow-up questions, not a guess.


Stage 3: Caching, queues, and async work

Goal. Make things fast. Decouple the slow parts.

The picture in your head.

flowchart LR
    U(["User"]):::u --> S(["Service"]):::s
    S -->|"check first"| C[("Cache<br/>~90% hit")]:::cache
    S -.->|"cache miss"| DB[("Database")]:::db
    S -->|"event"| QQueue:::q
    Q --> W(["Async Worker"]):::s

    classDef u fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    classDef s fill:#dcfce7,stroke:#15803d,color:#14532d
    classDef db fill:#fed7aa,stroke:#c2410c,color:#7c2d12
    classDef cache fill:#fecaca,stroke:#b91c1c,color:#7f1d1d
    classDef q fill:#ddd6fe,stroke:#6d28d9,color:#4c1d95

Topics.

GroupTopics
Where caches liveBrowser. CDN. In-process. Distributed (Redis, Memcached).
Cache strategiesRead-through, write-through, write-behind, cache-aside.
EvictionLRU, LFU, TTL. Why hit rate depends on this more than size.
The hard partCache invalidation. Hot keys. Thundering herd. Request coalescing.
Queues vs streamsSQS-style queue (one consumer). Kafka-style stream (many).
Delivery guaranteesAt-most-once, at-least-once, exactly-once. Why exactly-once is a lie.
IdempotencyIdempotency keys. Dedup. Why every retry-safe endpoint needs them.
PatternsOutbox pattern. CDC (change data capture). Dead letter queue. Backpressure.

Build this in week 12. Add Redis in front of Postgres. Measure the hit rate. Add Kafka or NATS. Move click-counting out of the request path into a background consumer.

You are done when you can draw a system where a write happens, an event is published, and three downstream services react without the original write caring.


Stage 4: Scaling and reliability

Goal. Survive growth and survive failure. These are the same conversation.

The picture in your head.

flowchart TB
    U(["Users"]):::u --> LB[/"Load Balancer"/]:::edge
    LB --> S1(["Service A"]):::s
    LB --> S2(["Service B"]):::s
    LB --> S3(["Service C"]):::s
    S1 --> DB[("DB primary")]:::db
    S2 --> DB
    S3 --> DB
    DB -.->|"if primary dies"| DB2[("DB standby")]:::db

    classDef u fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    classDef edge fill:#e2e8f0,stroke:#475569,color:#1e293b
    classDef s fill:#dcfce7,stroke:#15803d,color:#14532d
    classDef db fill:#fed7aa,stroke:#c2410c,color:#7c2d12

Topics.

GroupTopics
Scaling shapesVertical vs horizontal. Stateless vs stateful. Why the database is the hard part.
Load balancersL4 vs L7. Round-robin, least-connections, IP-hash, consistent-hash. Health checks.
Rate limitingToken bucket. Sliding window. Per-user, per-IP, per-endpoint.
Failure handlingTimeouts. Retries with backoff. Jitter. Circuit breakers. Bulkheads.
Graceful degradationWhat you serve when the recommendations engine is down. What you never compromise on.
Capacity planningAuto-scaling. Connection pools. Headroom.
Disaster recoveryRTO, RPO. Backups vs replicas. Region failover. Blast radius.

Build this in week 16. Put your service behind a load balancer. Run two copies. Kill one mid-request and watch what happens. Add a rate limiter on POST /links. Add a timeout and a retry on the cache call.

You are done when you can take any system and answer “what happens if X dies?” for every box in the diagram.


Stage 5: Distributed systems hard parts

Goal. Understand the genuinely subtle problems. This is where senior separates from mid.

The picture in your head.

flowchart LR
    subgraph US["US region"]
        SU(["Service"]):::s
        DU[("DB")]:::db
    end
    subgraph EU["EU region"]
        SE(["Service"]):::s
        DE[("DB")]:::db
    end
    DU <-.->|"replication + conflicts"| DE
    SU <-->|"cross-region call<br/>~100ms"| SE

    classDef s fill:#dcfce7,stroke:#15803d,color:#14532d
    classDef db fill:#fed7aa,stroke:#c2410c,color:#7c2d12

Topics.

GroupTopics
Clocks lieWhy machines disagree about time. Lamport timestamps. Vector clocks. Hybrid logical clocks.
ConsensusWhat consensus actually means. Paxos (the idea). Raft (the readable version). Why most teams use etcd or ZooKeeper.
Leader electionHow a cluster picks a leader. Split-brain. The brief window without one.
CoordinationDistributed locks. Why most “use a distributed lock” answers are wrong. When they are right.
Transactions across servicesTwo-phase commit (and why nobody uses it). Saga pattern. Compensating actions.
QuorumN, R, W. Why R + W > N gives strong consistency. The availability cost.
Strong modelsLinearizability. Serializability. Why they are not the same.
GeoData residency (GDPR forces this). Active-active vs active-passive. Follow-the-sun.

Build this in week 20. Set up Postgres replication with one primary and one replica. Force a failover. Time it. Read the Raft paper (the short one). Implement a tiny leader-election with three nodes using Redis (then realise why this is a bad idea, and remember that).

You are done when every system design answer you give ends with “and here is the trade-off I am accepting.”


Stage 6: Real architectures

Goal. Patterns that show up across many real products. Once you know the parts, the same shapes appear over and over.

Architectures to study.

ArchitectureWhat stresses the designWhere it lives in the real world
News feedPush vs pull fan-out. Celebrity problem.Twitter, Instagram, LinkedIn.
Real-time chatWebSockets. Presence. Ordering. Mobile reconnect.WhatsApp, Slack, Discord.
SearchInverted index. Ranking. Typo tolerance.Google, Algolia, Elasticsearch.
RecommendationsOnline serving (not training). Cold-start.Spotify, Netflix, TikTok.
Video streamingTranscoding ladder. Adaptive bitrate. CDN.YouTube, Netflix, Twitch.
Ride sharingReal-time location. Matching. State machine.Uber, Lyft, Bolt.
PaymentsIdempotency. Reconciliation. PCI scope.Stripe, Adyen, every bank.
NotificationsFan-out. Channel routing. Quiet hours. Retries.Push notifications, email blasts.
Approval workflowsState machine. Role resolution. Audit.Workday, ServiceNow, Jira.

Cross-cutting patterns.

PatternWhen it earns its complexity
API gatewayAlways, once you have more than two services.
Service mesh (Istio, Linkerd)Rarely. Only if you have 50+ services and a platform team.
CQRSWhen read and write paths look completely different.
Event sourcingWhen you need to replay history. Audit, finance, debugging.
Strangler figWhen you are replacing a legacy system. The only safe way to do it.

Build this in month 5-6. Pick three products you use every day. Sketch their architecture before you research. Then research. Then compare. The gap between your guess and reality is the lesson.

You are done when you can look at any product and sketch the high-level architecture from memory.


Stage 7: Interview craft (running in parallel from week 1)

Goal. Win the interview, not just know the material.

The five moves of a system design interview.

flowchart LR
    A[1. Clarify<br/>~5 min]:::a --> B[2. Capacity math<br/>~5 min]:::b
    B --> C[3. Minimum architecture<br/>~10 min]:::c
    C --> D[4. Deepen one part<br/>~15 min]:::d
    D --> E[5. Follow-up scenarios<br/>~10 min]:::e

    classDef a fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    classDef b fill:#dcfce7,stroke:#15803d,color:#14532d
    classDef c fill:#fef3c7,stroke:#a16207,color:#713f12
    classDef d fill:#fed7aa,stroke:#c2410c,color:#7c2d12
    classDef e fill:#fecaca,stroke:#b91c1c,color:#7f1d1d

Topics.

GroupTopics
Opening movesThe five clarifying questions that change every design. Naming the constraints out loud.
Capacity mathRequests per second. Storage per year. Bandwidth at peak. Practice until 90 seconds.
DrawingRectangle = service. Cylinder = data store. Hexagon = queue. Label every arrow.
Decisions out loudOptions, trade-offs, your pick. Say all three.
Going deepPick the part you know best when asked. Never the part the interviewer just asked about.
Common trapsOver-engineering from minute one. Forgetting auth. Forgetting rate limiting. Forgetting GDPR.
Handling “I do not know”“I am not certain, but my best guess is X because Y.” Better than confident bluffing.

Practice every day. Pick one of the practice problems on this track. Read the question. Write your answer in a doc. Then read the solution. Compare. Repeat.

You are done when you can walk into any senior-level interview and finish with the interviewer convinced you have been doing this for years.


The full topic matrix

If you want a single page to scan and check off, this is it.

AreaStage 1Stage 2Stage 3Stage 4Stage 5Stage 6
NetworkingHTTP, TCP, DNS, TLS  TLS terminationCross-region latency 
Data SQL, NoSQL, indexes, ACID, replication, sharding  Quorum, consistency models 
Caching  Redis, CDN, eviction, hot keys   
Messaging  Kafka, SQS, outbox, CDC, dedup   
ScaleLatency vs throughput  LB, auto-scale, rate limit  
Reliability   Retries, circuit breakers, DRLeader election, failover 
Distributed    Clocks, consensus, Raft, SagaGeo, multi-region
Patterns     API gateway, CQRS, ES
Architectures     Feed, chat, search, video
InterviewClarifying questionsMathDrawingGoing deepFollow-upsTrade-offs

Every cell that is empty is intentional. Those topics belong to a later stage. Do not jump.


The 6-month plan, week by week

gantt
    title 6-month learning plan
    dateFormat YYYY-MM-DD
    axisFormat %b

    section Foundations
    HTTP, TCP, DNS, TLS                :a1, 2026-01-01, 14d
    Latency numbers, async basics      :a2, after a1, 14d

    section Storage
    SQL, indexes, EXPLAIN              :b1, after a2, 14d
    NoSQL, replication, sharding       :b2, after b1, 14d

    section Caching and queues
    Redis, hot keys, stampede          :c1, after b2, 14d
    Kafka, outbox, idempotency         :c2, after c1, 14d

    section Scaling and reliability
    Load balancers, rate limits        :d1, after c2, 14d
    Retries, circuit breakers, DR      :d2, after d1, 14d

    section Distributed systems
    Clocks, consensus, Raft            :e1, after d2, 14d
    Saga, geo, quorum                  :e2, after e1, 14d

    section Real architectures
    Feed, chat, search                 :f1, after e2, 14d
    Video, ride share, payments        :f2, after f1, 14d

    section Interview craft
    One problem per day                :g1, 2026-01-01, 180d

Same plan as a table.

MonthStageFocusBuild by the end
1FoundationsHTTP, TCP, DNS, TLS, latency numbers, API styles.A tiny HTTP server with two endpoints, deployed to a free tier.
2StorageSQL, indexes, transactions, replication, sharding.Move your data to Postgres. Read an EXPLAIN plan.
3Caching and asyncRedis, eviction, hot keys, Kafka, idempotency.Add Redis and Kafka. Measure cache hit rate.
4Scaling and reliabilityLoad balancers, retries, circuit breakers, DR.Run two copies behind a LB. Kill one and watch.
5Distributed systemsConsensus, leader election, Saga, geo.Force a Postgres failover. Time it.
6Real architectures + interview practiceFeed, chat, search, video, payments.One practice problem per day. Compare to the solution.

Block one hour every morning. That is enough. Two hours is better. Three hours and you burn out.


A short note before you start

Nobody learns system design by reading. You learn it by drawing, building, breaking, and fixing. The roadmap is the map of the territory, not the territory.

The territory is everything that lives in production. Go build something small. Break it on purpose. Then fix it.

When you come back here, the practice problems are the exam. Use them.