Design YouTube / Netflix (Video Streaming)
What we are building
A video streaming platform takes a raw video file from a creator and delivers it to viewers around the world, in the right quality for their connection, as fast as possible.
Concretely: Alice records a 4K cooking video on her phone and uploads a 5 GB file. The platform transcodes it into seven resolutions (240p through 4K), cuts each into 6-second segments, and stores them in object storage. Bob is on a bus with a spotty LTE connection. He presses Play, the player starts at 360p, notices bandwidth improving, and silently switches to 720p mid-stream. Neither Bob nor Alice had to think about any of this.
That is the whole product. That is YouTube.
The problem looks like one system. It is actually two systems that share a database.
Four hard problems hide inside it:
- The upload pipeline vs the playback path. They are separate systems with opposite requirements. Designing them as one is the most common mistake.
- Transcoding cost and codec choice. H.264 encodes at 100x real-time. AV1 encodes at 0.07x real-time. Choosing the wrong codec ladder can multiply compute cost by 20x.
- CDN strategy and cache hit rate. 250 Tbps of peak egress is physically impossible to serve from a data center. A three-tier CDN with regional shields is the only architecture that makes the numbers work.
- Adaptive bitrate switching. Video segments across all resolutions must cut at identical timestamps, or a quality switch mid-stream causes a visible glitch.
We will start with the smallest thing that works, then add one layer at a time as each problem appears.
The lifecycle of one video
Every video moves through a short set of states from upload to playback.
stateDiagram-v2
direction LR
[*] --> Uploading: creator presses Upload
Uploading --> Transcoding: all chunks received
Transcoding --> Ready: all quality variants done
Ready --> Streaming: viewer presses Play
Streaming --> [*]: playback ends
Ready --> Blocked: takedown or moderation
Blocked --> [*]
A video spends most of its life in Ready, getting watched. The upload and transcoding states are temporary. The Blocked transition exists for takedowns and copyright enforcement, both of which must propagate globally in under 60 seconds.
Take this with you. Video streaming is a pipeline, not a service. Upload and playback are two different systems. Treat them that way from the first sentence.
How big this gets
A YouTube-shaped platform gives us these numbers to work with.
| Input | Number |
|---|---|
| Uploads | 500 hours of video per minute |
| Concurrent viewers, peak | 125 million |
| Peak egress | 250 Tbps |
| Storage growth | 70 PB per year |
| Upload size limit | 256 GB per file |
From these we can derive everything else.
Show: the derived numbers
| Metric | Value | How |
|---|---|---|
| Upload ingest, steady | ~5 Gbps | 500 hr/min × 10 Mbps avg bitrate / 8 |
| Upload ingest, peak | ~15 Gbps | 3x steady |
| New videos per day | ~450,000 | 500 hr/min × 60 × 24 / 4 min avg |
| New videos per second | ~5 | 450K / 86,400 |
| Average concurrent viewers | ~42 million | 1B watch-hours/day × 3600 / 86400 |
| Peak egress | 250 Tbps | 125M × 2 Mbps avg |
| Source storage, annual | ~20 PB | 500 sec/sec × 86,400 × 365 × 10 Mbps / 8 |
| Transcoded storage, annual | ~50 PB | ~2.5x source (7-9 quality variants) |
| Transcoding cores needed | ~3,500 | H.264 at 100x real-time, 5 sec/sec ingest |
Three numbers dominate the whole design:
| Number | Size | Why it matters |
|---|---|---|
| Peak egress | 250 Tbps | Requires a global CDN with hundreds of PoPs |
| Transcoding compute | ~3,500 CPU cores at H.264 | Running 24/7 just to keep up |
| Storage | 70 PB/year, forever | Tiering by access frequency saves ~5x in cost |
Everything in the architecture exists to keep one of these three numbers manageable.
Take this with you. Two costs dominate: CDN egress and transcoding compute. Storage is third. Everything else is rounding error.
The smallest version that works
Forget YouTube. We are a tiny platform with 100 creators and 10,000 viewers.
Three boxes. One upload flow.
flowchart TB
C([Creator]):::user --> U["Upload Service<br/>(receive file, enqueue)"]:::app
U --> S3src[("S3 source files")]:::db
S3src --> TW["Transcoding Worker<br/>(ffmpeg)"]:::app
TW --> S3out[("S3 segments")]:::db
S3out -.CDN.-> V([Viewer]):::user
classDef user fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
classDef db fill:#fed7aa,stroke:#c2410c,color:#7c2d12
Two endpoints carry the core product.
| Endpoint | What it does |
|---|---|
POST /uploads | Accept a file, return video_id and status=transcoding |
GET /videos/:id | Return metadata + signed CDN URL to master.m3u8 |
Show: the two key tables
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
CREATE TABLE videos (
video_id VARCHAR(16) PRIMARY KEY,
owner_id BIGINT NOT NULL,
title VARCHAR(200),
status TEXT NOT NULL, -- 'uploading', 'transcoding', 'ready', 'blocked'
source_path TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE variants (
video_id VARCHAR(16),
quality VARCHAR(20), -- '360p_h264', '1080p_h265'
status TEXT NOT NULL, -- 'pending', 'encoding', 'ready', 'failed'
output_path TEXT,
bitrate_kbps INT,
completed_at TIMESTAMPTZ,
PRIMARY KEY (video_id, quality)
);
variants has one row per (video, quality) so each transcoding job updates its own row independently. No locking on a shared JSON blob.
This is enough for a hundred creators on a Tuesday. The interesting question is what breaks first as the platform grows. Three things will: how uploads survive poor connections, how transcoding keeps up with demand, and how video bytes reach 125 million people at once. We address each in turn.
Decision 1: how do we handle large uploads reliably?
A creator on a phone uploads a 2 GB file. Their Wi-Fi drops at 1.9 GB. Without a resumable upload protocol, they start over. This is not acceptable.
The fix is to split the file into chunks and track progress server-side. The client calls HEAD /uploads/<id> after reconnecting to ask “where did we stop?” and resumes from that byte offset.
flowchart LR
subgraph Upload["Resumable upload flow"]
A["POST /uploads<br/>get upload_id"] --> B["PATCH chunk 1<br/>offset 0, 16 MB"]
B --> C["PATCH chunk 2<br/>offset 16 MB"]
C --> D["..."]
D --> E["POST /finalize<br/>verify sha256"]
end
subgraph Resume["On reconnect"]
F["HEAD /uploads/<id><br/>get current offset"] --> G["PATCH chunk N<br/>resume from offset"]
end
This is the TUS protocol (or S3 multipart, which uses the same idea). The only cost is that the Upload Service must track chunk offsets in its database. Each chunk is 16 MB. A 5 GB upload is 320 chunks. A single failed chunk retries, not the whole file.
There is one subtle problem at finalize: the client may call it twice if their connection drops during the response. The finalize must be idempotent. If the multipart upload is already complete, return the existing video_id without creating a duplicate.
Take this with you. Chunked resumable uploads are not an optimization. They are what makes a mobile upload platform function at all. Without them, anyone on a phone uploading over 100 MB will sometimes fail.
Decision 2: how do we transcode at scale?
A source video arrives. It needs to become seven or more quality variants, each cut into 6-second segments. That is compute-heavy work that takes minutes per video. The platform needs to do it for thousands of videos per day without blocking new uploads.
The answer is a durable job queue and a pool of workers.
flowchart TB
U["Upload Service"]:::app --> K{{"Kafka<br/>transcode.requested<br/>× 7 jobs"}}:::queue
K --> TW1["Worker (H.264 pool)"]:::app
K --> TW2["Worker (H.265 pool)"]:::app
K --> TW3["Worker (AV1 pool)"]:::app
TW1 --> S3out[("S3 segments")]:::db
TW2 --> S3out
TW3 --> S3out
TW1 --> K2{{"Kafka<br/>transcode.completed"}}:::queue
TW2 --> K2
TW3 --> K2
K2 --> MB["Manifest Builder"]:::app
MB --> S3out
classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
classDef db fill:#fed7aa,stroke:#c2410c,color:#7c2d12
classDef queue fill:#ddd6fe,stroke:#6d28d9,color:#4c1d95
A few rules make this work in production:
Codec selection changes everything. The codec determines how much compute each job needs.
| Codec | Encode speed | File size vs H.264 | Decode support | Use case |
|---|---|---|---|---|
| H.264 | ~100x real-time | baseline | Universal | Always ship this |
| H.265 | ~33x real-time | 30% smaller | Near-universal | Worth it above early stage |
| AV1 | ~0.07x real-time | 50% smaller | Modern devices | Top 1-5% by watch time only |
Encoding one 4-minute video in AV1 takes about 20 minutes on a single CPU core. At H.264 speeds, the same video takes 24 seconds. Using AV1 for everything would require 15-30x more compute for the same ingest rate.
Segment alignment across qualities is not optional. Every quality must cut at exactly the same timestamps (0s, 6s, 12s…). The player switches quality between segments. If 360p cuts at different boundaries than 1080p, the quality switch causes a visible frame jump. Use -force_key_frames "expr:gte(t,n_forced*6)" in ffmpeg.
Idempotent output. Workers write to the same S3 keys every run. If a worker crashes mid-job and Kafka redelivers the message, the second run overwrites the same keys cleanly. This makes at-least-once Kafka delivery safe.
Janitor for stuck jobs. A periodic scan finds variants rows with status=encoding older than 2x the expected job duration and republishes them to Kafka. Workers are stateless; restarting is always safe.
Take this with you. The queue makes the pipeline reliable. The codec ladder determines the compute bill. Segment alignment at the same timestamps across all qualities is what makes adaptive bitrate switching work without visual glitches.
Decision 3: how does the CDN serve 250 Tbps?
No single data center can send 250 Tbps of video. At 125 million concurrent viewers averaging 2 Mbps, the math is simple: we need roughly 200+ points of presence around the world, each caching the most popular content for viewers in that city.
But a flat CDN has a fatal flaw: if 200 edge nodes all miss the same segment, they each fire a separate request to S3 origin. One segment would cause 200 S3 requests. A regional shield fixes this.
flowchart TB
V(["Viewers in Tokyo, São Paulo, Lagos"]):::user
subgraph Tier1["Tier 1: Edge PoPs (200+ worldwide)"]
E1["Tokyo PoP<br/>~95% hit"]:::edge
E2["São Paulo PoP<br/>~95% hit"]:::edge
E3["Lagos PoP<br/>~95% hit"]:::edge
end
subgraph Tier2["Tier 2: Regional Shields (10-20 worldwide)"]
SH1["Asia Shield<br/>~80% hit on edge misses"]:::edge
SH2["Americas Shield<br/>~80% hit on edge misses"]:::edge
end
subgraph Tier3["Tier 3: Origin"]
S3out[("S3 segments")]:::db
end
V --> E1
V --> E2
V --> E3
E1 -.miss.-> SH1
E2 -.miss.-> SH2
E3 -.miss.-> SH1
SH1 -.miss.-> S3out
SH2 -.miss.-> S3out
classDef user fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
classDef edge fill:#e2e8f0,stroke:#475569,color:#1e293b
classDef db fill:#fed7aa,stroke:#c2410c,color:#7c2d12
The arithmetic: at 95% edge hit rate, only 5% of requests reach the shield. At 80% shield hit rate, only 20% of those reach S3. S3 sees roughly 1% of total traffic. At 250 Tbps total, that means S3 handles about 2.5 Tbps. Without the CDN, S3 would need to serve all 250 Tbps.
Cache TTL choices follow the content type.
| Content | TTL | Reason |
|---|---|---|
Video segments (seg_N.m4s) | 7 days | Immutable once written |
Master playlist (master.m3u8) | 60 seconds | New qualities become available as transcoding completes |
| Thumbnails | 1 year | Immutable once created |
| Signed URLs | 4-8 hours | Expire before they can be shared for free |
Take this with you. “Use a CDN” is not a design. The regional shield is the critical piece. Without it, S3 cannot absorb the request volume on cache misses.
Decision 4: how does the player pick the right quality?
The player, not the server, decides which quality to request next. The server never pushes video. The player pulls one segment at a time and adjusts quality between segments based on measured bandwidth and buffer depth. This is ABR (adaptive bitrate streaming).
stateDiagram-v2
direction LR
[*] --> Starting: press Play
Starting --> Buffering: fetch master.m3u8, pick 360p
Buffering --> Playing: first segment decoded
Playing --> Playing: fetch next segment, measure speed
Playing --> SwitchUp: buffer > 30s AND bandwidth allows higher quality
Playing --> SwitchDown: buffer < 5s, refill fast
SwitchUp --> Playing: next request at higher quality
SwitchDown --> Playing: next request at lower quality
Playing --> [*]: video ends
The master playlist tells the player what resolutions and bitrates are available:
1
2
3
4
5
6
7
#EXTM3U
#EXT-X-STREAM-INF:BANDWIDTH=700000,RESOLUTION=640x360
360p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=2500000,RESOLUTION=1280x720
720p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=5000000,RESOLUTION=1920x1080
1080p/playlist.m3u8
HLS and DASH do the same job with different file formats. Modern platforms write segments in .m4s (the CMAF format) and generate both an HLS .m3u8 and a DASH .mpd manifest pointing at the same segment files. One set of segments, two manifests, all devices covered.
Take this with you. The player never asks your servers for video bytes. It asks the CDN. Your Watch Page API only produces the signed URL and metadata. After that, your code is out of the loop.
Decision 5: how do we manage 70 PB/year of storage?
Most videos stop being watched after the first week. Storing them all in S3 Standard costs $0.023/GB/month. At 70 PB/year and 10 years of data, that is hundreds of millions of dollars. Storage tiering by access frequency reduces the blended cost by 4-5x.
flowchart LR
New["New segments<br/>uploaded today"]:::app -->|"0-7 days"| Hot["S3 Standard<br/>$0.023/GB/mo"]:::db
Hot -->|"7-90 days<br/>(no recent views)"| Warm["S3 Infrequent Access<br/>$0.0125/GB/mo"]:::db
Warm -->|"90+ days<br/>(no recent views)"| Cold["S3 Glacier Instant<br/>$0.004/GB/mo"]:::db
Cold -->|"1+ year<br/>(no views at all)"| Archive["S3 Glacier Deep Archive<br/>$0.00099/GB/mo"]:::db
classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
classDef db fill:#fed7aa,stroke:#c2410c,color:#7c2d12
A daily tiering job reads last_viewed_at from the analytics store and emits S3 lifecycle transitions in bulk. When a cold video suddenly goes viral, it is promoted back to Standard within minutes.
Source files are never deleted. A new codec ships every few years. You need the original source to re-encode without quality loss when that happens.
With tiering at roughly 5% hot / 15% warm / 80% cold, the blended cost drops from $0.023/GB to about $0.005/GB.
Take this with you. Source files are kept forever. Transcoded variants are tiered aggressively. That combination preserves quality for future re-encoding while keeping storage costs manageable.
The full architecture
Pulling the five decisions together gives us the system.
flowchart TB
subgraph Edge["Client edge"]
C(["Creator / Viewer"]):::user
GW["API Gateway<br/>(auth · rate limit · routing)"]:::edge
end
subgraph UploadPipeline["Upload pipeline"]
U["Upload Service<br/>(TUS / S3 multipart)"]:::app
S3src[("S3 source<br/>hot→IA at 30d→Glacier at 90d")]:::db
K{{"Kafka<br/>transcode.requested<br/>transcode.completed<br/>video.published"}}:::queue
TW["Transcoding Farm<br/>H.264 · H.265 · AV1 pools"]:::app
MB["Manifest Builder"]:::app
DB[("Postgres<br/>videos · variants · manifests")]:::db
end
subgraph PlaybackPath["Playback path"]
WA["Watch Page API<br/>(metadata + sign URLs)"]:::app
RD[("Redis<br/>hot metadata cache")]:::cache
end
S3out[("S3 segments<br/>hot/warm/cold tiered")]:::db
subgraph CDNStack["CDN (3-tier)"]
EP["Edge PoPs<br/>(200+ worldwide, ~95% hit)"]:::edge
SH["Regional Shields<br/>(10-20 worldwide, ~80% hit on misses)"]:::edge
end
subgraph Telemetry["Telemetry pipeline"]
TK{{"Kafka<br/>playback events"}}:::queue
FL["Flink<br/>(aggregate, 1-min windows)"]:::app
VDB[("Redis + ClickHouse<br/>view counts · analytics")]:::db
end
C --> GW
GW -->|upload| U
GW -->|watch| WA
GW -->|telemetry| TK
U --> S3src
U --> DB
U --> K
K --> TW
TW --> S3out
TW --> K
K --> MB
MB --> S3out
MB --> DB
WA --> DB
WA --> RD
WA -->|"signed URL"| EP
EP -.miss.-> SH
SH -.miss.-> S3out
TK --> FL
FL --> VDB
classDef user fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
classDef edge fill:#e2e8f0,stroke:#475569,color:#1e293b
classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
classDef db fill:#fed7aa,stroke:#c2410c,color:#7c2d12
classDef cache fill:#fecaca,stroke:#b91c1c,color:#7f1d1d
classDef queue fill:#ddd6fe,stroke:#6d28d9,color:#4c1d95
Each component, in one sentence:
| Component | Purpose |
|---|---|
| API Gateway | Auth, rate limiting, routes upload vs playback vs telemetry traffic |
| Upload Service | Accepts chunked uploads, issues TUS tokens, triggers transcoding |
| Kafka | Durable queue between pipeline stages. Worker crashes do not lose jobs |
| Transcoding Farm | Pools of ffmpeg workers, one pool per codec. Autoscales on Kafka lag |
| Manifest Builder | Assembles master.m3u8 as quality variants complete |
| Postgres | Video catalog, per-variant job status |
| Watch Page API | Reads metadata, signs CDN URLs, returns JSON. The only code on the playback hot path |
| CDN (Edge + Shield + Origin) | Serves all video bytes. Three tiers absorb 99% of traffic before S3 |
| Redis | Hot metadata cache. Cuts Postgres reads by 100x for popular videos |
| Flink + ClickHouse | Stream-aggregates view counts and watch-time data for creator dashboards |
Notice what is not on the synchronous playback path: analytics, view counts, transcoding. If Flink has a lag spike at 3 a.m., viewers keep watching. View counts just update a bit late.
Walk: an upload, end to end
Alice records a 5 GB cooking video and presses Upload.
sequenceDiagram
autonumber
participant Alice
participant GW as API Gateway
participant U as Upload Service
participant S3src as S3 (source)
participant K as Kafka
participant W as Transcoding Worker
participant S3out as S3 (segments)
participant MB as Manifest Builder
participant DB as Postgres
Alice->>GW: POST /uploads (filename, size, sha256)
GW->>U: forward (auth ok)
rect rgb(241, 245, 249)
Note over U,DB: one transaction
U->>DB: INSERT video (status=uploading)
U->>S3src: initiate S3 multipart upload
U-->>Alice: video_id=v_8h3jK2p, chunk_size=16 MB
end
loop 320 chunks × 16 MB
Alice->>U: PATCH chunk N (offset=N×16 MB)
U->>S3src: write part N
U-->>Alice: 204, new offset
end
Note over Alice,U: Wi-Fi drops. Alice calls HEAD /uploads/v_8h3jK2p, gets offset, resumes
Alice->>U: POST /finalize
U->>S3src: complete multipart (assemble 5 GB)
U->>DB: UPDATE video (status=transcoding)
U->>K: emit transcode.requested × 7 (one per quality)
U-->>Alice: 202 { status: "transcoding" }
par 7 quality jobs run in parallel
W->>K: consume transcode.requested
W->>S3src: download source (~5 GB)
W->>W: ffmpeg: 6s segments, aligned keyframes
W->>S3out: write segments + per-quality playlist
W->>K: emit transcode.completed { quality }
end
MB->>K: consume transcode.completed (waits for all 7)
MB->>S3out: write master.m3u8
MB->>DB: UPDATE video (status=ready)
Three things worth pointing at:
- The
INSERT videoandinitiate multiparthappen in the same transaction. If the server crashes mid-way, the video row and the S3 upload either both exist or neither does. - Workers produce segments at the same cut points across all qualities. That alignment is what lets the player switch quality mid-stream without a visual glitch.
- The Manifest Builder publishes a partial
master.m3u8after the first quality completes, so Alice can watch a 360p version while 1080p is still encoding. The master playlist has a 60-second CDN TTL, so viewers see new qualities appear quickly.
Walk: a playback request, end to end
Bob opens the watch page on his commute.
sequenceDiagram
autonumber
participant Bob
participant GW as API Gateway
participant WA as Watch Page API
participant RD as Redis
participant DB as Postgres
participant CDN as CDN Edge PoP
participant S3out as S3 (origin)
Bob->>GW: GET /videos/v_8h3jK2p
GW->>WA: forward (~5ms)
WA->>RD: GET video:v_8h3jK2p
alt Redis cache hit (~90% for popular videos)
RD-->>WA: metadata (~1ms)
else cache miss
WA->>DB: SELECT video, manifests (~10ms)
DB-->>WA: rows
WA->>RD: SET video:v_8h3jK2p TTL=5min
end
WA-->>Bob: title, signed master.m3u8 URL (~50ms total)
Bob->>CDN: GET master.m3u8 (signed URL)
alt edge cache hit (~95%)
CDN-->>Bob: 1 KB playlist (~50ms)
else miss → shield → S3
CDN->>S3out: GET master.m3u8
S3out-->>CDN: file
CDN-->>Bob: 1 KB playlist
end
Note over Bob: player picks 360p to start safe
Bob->>CDN: GET 360p/seg_0.m4s
CDN-->>Bob: ~1 MB segment (6 sec of video, ~200ms)
Note over Bob: starts playing, measures 40 Mbps available
Bob->>CDN: GET 720p/seg_1.m4s
CDN-->>Bob: ~1.9 MB segment (~250ms)
Note over Bob,CDN: player keeps fetching segments, adjusting quality as bandwidth changes
Latency budget for first frame:
| Step | Typical time |
|---|---|
| API call (Redis hit) | ~50ms |
GET master.m3u8 (CDN edge hit) | ~50ms |
GET 360p/playlist.m3u8 (CDN edge hit) | ~50ms |
GET seg_0.m4s (CDN edge hit) | ~200ms |
| First frame decode | ~50ms |
| Total | ~400ms |
A 2-second start-time SLO is achievable even with cold-cache misses.
The hard sub-problem: deleting a video fast
A creator deletes a video. Or trust-and-safety issues a takedown. The video is cached in hundreds of edge PoPs worldwide. How do you stop playback globally within 60 seconds?
The naive approach is to set TTLs short enough that caches expire quickly. The problem is that short TTLs destroy the CDN’s effectiveness for the normal case. A 60-second TTL on segments means 99%+ of traffic hits the shield or S3 origin.
The real answer uses CDN purge + signing key revocation together.
flowchart LR
TS["Trust & Safety<br/>API call: block video"]:::app --> DB[("Postgres<br/>status=blocked")]:::db
DB -->|CDC event| K{{"Kafka<br/>video.blocked"}}:::queue
K --> WA["Watch Page API<br/>refuses to sign URLs"]:::app
K --> CDN_API["CDN Purge API<br/>/v_8h3jK2p/*"]:::edge
K --> KeyRevoke["Signing Key Revocation<br/>(if DRM enabled)"]:::app
classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
classDef db fill:#fed7aa,stroke:#c2410c,color:#7c2d12
classDef queue fill:#ddd6fe,stroke:#6d28d9,color:#4c1d95
classDef edge fill:#e2e8f0,stroke:#475569,color:#1e293b
The sequence:
- Status flips to
blockedin Postgres (milliseconds). - CDC event reaches Kafka (< 1 second).
- Watch Page API stops issuing signed URLs for this video (< 5 seconds).
- CDN purge propagates globally on major CDNs (Cloudflare, Fastly, Akamai) in 5-30 seconds.
- Viewers mid-playback drain their 30-second buffer, then see an error.
New viewers cannot start the video within 5-30 seconds. Viewers already watching see it stop within 30 seconds. The 60-second target is reachable.
Note: the S3 objects are not deleted. They are moved to a restricted bucket only the legal team can access. Takedowns often need to be reversed on appeal.
Take this with you. Deletion is not a single database write. It is three steps: refuse new signed URLs, purge CDN caches, revoke signing keys. Kafka carries the event so each layer can react independently.
Follow-up questions
Try answering each in 2 or 3 sentences before opening the solution.
Delete a viral video. A creator deletes a video with 100M views. It is cached in hundreds of edge PoPs. How do you stop playback globally within 60 seconds?
AV1 backfill. You want to re-encode the top 10,000 videos in AV1 to reduce bandwidth costs. How do you pick which videos to prioritize, and how do you run the job without disrupting live uploads?
Live streaming. A creator wants to broadcast a concert to 5 million concurrent viewers with under 3 seconds of delay. What changes in the architecture? What stays the same?
Thumbnails at scale. Every video needs 1-3 main thumbnails and auto-generated frames for the seek bar. How do you generate, store, and serve them? (There are about 120 seek frames per 4-minute video.)
Copyright takedown. A valid takedown notice arrives. You must block playback globally within 5 minutes. You cannot delete the source file. How do you do it?
Watch-time analytics. A creator wants to see “60% of viewers dropped off at the 3:47 mark.” Where does that data come from, and how do you compute it across billions of viewer sessions?
DRM (Netflix mode). The product switches to a subscription model. Every segment must be encrypted. Every device must get a decryption key before playback. What does the key flow look like?
Multiple audio tracks. A video has English, Spanish, and Hindi audio, plus 12 subtitle languages. How do these fit into an HLS master playlist? How does the player know which audio to download?
Regional shield outage. Your Asia regional shield goes down. What is the blast radius? What happens to the 200 edge PoPs behind it?
Real-time view counts. The recommendation team needs view counts with under 5 seconds of freshness for ranking signals. Your current Flink pipeline has 30 minutes of lag. What do you change?
Related problems
- URL Shortener (001). Introduces CDN caching, TTL trade-offs, and hot-key problems at a smaller scale.
- Notification System (010). The “your video is ready” and “new video from someone you follow” notifications run through it.
- News Feed (002). The watch page entry is one row in a feed. They share the metadata store and the follow graph.
- Distributed Cache (009). The manifest cache and hot-metadata cache use the same principles.
Try the problem on your own first. Solutions are most valuable after you've struggled with it.
Solution: Design YouTube / Netflix (Video Streaming)
The short version
Video streaming is two systems that share a database.
The upload pipeline is slow, batch, and compute-heavy. It takes a raw file, converts it to 7-9 quality variants, cuts each into 6-second segments, and writes a playlist. Done once per video.
The playback path is fast, read-heavy, and CDN-dominated. It delivers bytes to millions of people at once. The only code you own on the playback hot path is the Watch Page API, which signs a CDN URL and returns metadata. Every actual byte of video comes from the CDN.
Both systems share a metadata database (titles, owners, status) and object storage (the video files). Everything else is separate.
Three choices dominate everything:
- Keep upload bytes off your servers. TUS chunked uploads go directly to S3 multipart. 15 Gbps of ingest is S3’s problem.
- Codec choice is the compute budget. H.264 encodes at 100x real-time. AV1 encodes at 0.07x real-time. Use AV1 only for the top 1-5% of videos by watch time.
- The CDN shield is not optional. Edge PoP → regional shield → S3 origin. A 95% edge hit rate means S3 sees roughly 1% of all traffic. That is the only way the numbers work.
Numbers to know: 500 hours uploaded per minute, 125 million concurrent viewers at peak, 250 Tbps egress, 70 PB of new storage per year.
1. The two questions that matter most
YouTube or Netflix? The whole design forks here. YouTube is a long-tail UGC platform: billions of videos, 80% of them cold, aggressive tiering required. Netflix is a curated library: roughly 15,000 titles, nearly every video is hot, tiering matters less. The codec and cost story is completely different.
Live or VOD? Live needs real-time transcoding, RTMP/WebRTC ingest, and a LL-HLS or LL-DASH delivery stack with 1-2 second segments. VOD prepares files once and serves them forever. They share almost no infrastructure. Call your scope early.
Everything else (codec ladder, DRM, storage tiering, analytics freshness) follows from those two answers.
2. The math, in plain numbers
| Number | Value | What it tells you |
|---|---|---|
| Upload ingest | 5 Gbps sustained, 15 Gbps peak | Chunked upload direct to S3, not to your servers |
| New videos per day | ~450,000 (5/sec sustained) | Kafka queue depth for transcoding jobs |
| Concurrent viewers | 42M average, 125M peak | CDN sizing |
| Peak egress | 250 Tbps | The reason a three-tier CDN exists |
| Source storage | 20 PB/year | Cheap if tiered; kept forever for re-encoding |
| Transcoded storage | 50 PB/year | About 2.5x the source (7-9 quality variants) |
| Transcoding compute | ~3,500 CPU cores (H.264 only) | Add several thousand more for AV1 |
Two costs dominate: CDN egress and transcoding compute. Storage is third. Everything else is rounding error.
3. The API
Two flows: upload (creator side) and playback (viewer side).
Start an upload:
1
2
3
4
5
6
7
8
9
10
POST /api/v1/uploads
Idempotency-Key: <uuid>
{
"filename": "cooking-video.mp4",
"size_bytes": 5368709120,
"sha256": "abc123...",
"title": "Spaghetti carbonara",
"visibility": "public"
}
Response includes video_id, the upload endpoint URL, and chunk_size (16 MB).
Send a chunk (TUS style):
1
2
3
4
5
PATCH /u/<upload_id>
Upload-Offset: 33554432
Content-Length: 16777216
Content-Type: application/offset+octet-stream
<binary chunk>
Returns 204 No Content with the new Upload-Offset. To resume after a drop, HEAD /u/<upload_id> returns the current offset.
Finalize:
1
POST /api/v1/uploads/<upload_id>/finalize
Returns { "video_id": "v_8h3jK2p", "status": "transcoding" }.
Get video info:
1
2
3
4
5
6
7
8
9
10
11
GET /api/v1/videos/v_8h3jK2p
{
"status": "ready",
"manifests": {
"hls": "https://cdn.example.com/v_8h3jK2p/master.m3u8?sig=...",
"dash": "https://cdn.example.com/v_8h3jK2p/manifest.mpd?sig=..."
},
"available_qualities": ["360p", "480p", "720p", "1080p"],
"duration_seconds": 248
}
Manifests and segments are served by the CDN, not by your API. Your API produces the signed URL. The CDN serves every byte.
Playback telemetry (batched, fire and forget):
1
2
3
4
5
6
7
8
9
POST /api/v1/telemetry/playback
{
"video_id": "v_8h3jK2p",
"events": [
{ "ts": 1716364800000, "type": "start", "quality": "720p" },
{ "ts": 1716364812000, "type": "rebuffer", "duration_ms": 850 },
{ "ts": 1716365100000, "type": "complete", "watched_seconds": 300 }
]
}
Three choices worth defending:
| Choice | Reason |
|---|---|
Idempotency-Key on create | A mobile client retries on timeout. Without the key, retries create duplicate video rows and double-charged storage budgets. |
| Signed URLs on manifests | Short-lived (4-8 hours). Stops third-party sites from embedding your video for free. |
| Telemetry off the critical path | Players batch events and POST every few seconds. A lost batch means approximate analytics, not a broken product. |
4. The data model
Four tables. Postgres works fine until you reach billions of videos; then Cassandra with video_id as the partition key.
erDiagram
videos ||--o{ variants : has
videos ||--|| manifests : "one when ready"
videos ||--o{ view_counts : tracked
videos {
varchar video_id
bigint owner_id
text status
text source_path
timestamptz created_at
}
variants {
varchar video_id
varchar quality
text status
int bitrate_kbps
text output_path
timestamptz completed_at
}
manifests {
varchar video_id
text hls_master_url
text dash_mpd_url
text[] available_qualities
}
view_counts {
varchar video_id
bigint total_views
bigint total_watch_seconds
timestamptz last_viewed_at
}
Show: the full SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
CREATE TABLE videos (
video_id VARCHAR(16) PRIMARY KEY,
owner_id BIGINT NOT NULL,
title VARCHAR(200),
visibility SMALLINT NOT NULL DEFAULT 1, -- 1=public 2=unlisted 3=private
status TEXT NOT NULL DEFAULT 'uploading',
source_path TEXT,
duration_ms BIGINT,
sha256 BYTEA,
idempotency_key TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
published_at TIMESTAMPTZ
);
CREATE INDEX idx_videos_owner ON videos (owner_id, created_at DESC);
CREATE TABLE variants (
video_id VARCHAR(16),
quality VARCHAR(20), -- '360p_h264', '1080p_h265', '720p_av1'
status TEXT NOT NULL DEFAULT 'pending',
bitrate_kbps INT,
codec VARCHAR(16),
segment_count INT,
output_prefix TEXT,
started_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ,
PRIMARY KEY (video_id, quality)
);
CREATE TABLE manifests (
video_id VARCHAR(16) PRIMARY KEY,
hls_master_url TEXT,
dash_mpd_url TEXT,
available_qualities TEXT[],
updated_at TIMESTAMPTZ
);
CREATE TABLE view_counts (
video_id VARCHAR(16) PRIMARY KEY,
total_views BIGINT NOT NULL DEFAULT 0,
total_watch_seconds BIGINT NOT NULL DEFAULT 0,
last_viewed_at TIMESTAMPTZ,
last_updated TIMESTAMPTZ
);
Three choices worth defending:
variants has one row per (video, quality). Each transcoding job updates its own row independently as it completes. No compare-and-swap on a shared JSON blob, no lock contention between workers.
view_counts is a separate table. Updated by the analytics pipeline at its own cadence. The write path for views is separate from the write path for catalog edits. Mixing them would force every view event through the catalog’s write path.
manifests is separate from videos. The Manifest Builder writes it when enough qualities complete. This lets you show a partial playlist (360p available, 1080p still encoding) without touching the main videos row.
Why Postgres first, Cassandra later: at startup, ACID transactions matter (update variant status and publish Kafka message atomically). At 10 billion videos, point lookups by video_id with no joins are all you need. That is exactly what Cassandra is designed for.
5. The transcoding pipeline
One source file spawns 7-9 jobs, one per (quality, codec) combination.
| Job | Codec | Resolution | Bitrate |
|---|---|---|---|
| 1 | H.264 | 426×240 | 400 kbps |
| 2 | H.264 | 640×360 | 700 kbps |
| 3 | H.264 | 854×480 | 1,200 kbps |
| 4 | H.264 | 1280×720 | 2,500 kbps |
| 5 | H.264 | 1920×1080 | 5,000 kbps |
| 6 | H.265 | 1280×720 | 1,500 kbps |
| 7 | H.265 | 1920×1080 | 3,000 kbps |
| 8 | AV1 (top 1% only) | 1280×720 | 1,000 kbps |
| 9 | AV1 (top 1% only) | 1920×1080 | 2,000 kbps |
Show: one worker's loop
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
while True:
job = kafka.consume("transcode.requested", timeout=0.5)
if job is None:
continue
db.update_variant(job.video_id, job.quality, status="encoding")
source = s3.download(f"s3://video-source/{job.video_id}/source.mp4")
run_ffmpeg(
input=source,
codec=job.codec,
bitrate=job.bitrate,
resolution=job.resolution,
segment_duration=6,
force_key_frames="expr:gte(t,n_forced*6)", # align cuts across qualities
output_format="m4s",
output_prefix=f"s3://transcoded/{job.video_id}/{job.quality}/",
)
s3.upload(f"{job.output_prefix}/playlist.m3u8", build_variant_playlist(job))
db.update_variant(job.video_id, job.quality, status="ready", segment_count=...)
kafka.publish("transcode.completed", {"video_id": job.video_id, "quality": job.quality})
Four details that matter for production:
Segment alignment across qualities. Every quality cuts at exactly the same timestamps (0s, 6s, 12s, 18s…). The player switches qualities between segments. If 360p cuts at different boundaries than 1080p, the switch causes a visible frame jump. Use -force_key_frames to enforce this.
Idempotent output. Re-running the same job overwrites the same S3 keys. This is intentional. Kafka may redeliver a message after a worker crash. Idempotent output makes at-least-once delivery safe.
Janitor for stuck jobs. A periodic scan finds variants rows with status=encoding older than 2x the expected job duration and republishes them to Kafka. Workers are stateless; restarting is always safe.
Manifest assembly. A separate Manifest Builder consumes transcode.completed events. It writes a partial master.m3u8 after the first quality completes, then rewrites it as more variants finish. The master playlist has a 60-second CDN TTL so viewers see new qualities appear quickly as they complete.
6. The engine: adaptive bitrate
The player, not the server, decides which quality to request next. This is ABR (adaptive bitrate streaming).
stateDiagram-v2
direction LR
[*] --> Starting: press Play
Starting --> Buffering: fetch master.m3u8, pick 360p
Buffering --> Playing: first segment decoded
Playing --> Playing: fetch next segment, measure speed
Playing --> SwitchUp: buffer > 30s AND bandwidth allows higher quality
Playing --> SwitchDown: buffer < 5s, need to refill fast
SwitchUp --> Playing: next request at higher quality
SwitchDown --> Playing: next request at lower quality
Playing --> [*]: video ends
The player keeps about 30 seconds of video in its buffer. It measures bandwidth on each segment download. If a 1 MB segment downloads in 200ms, that is 5 MB/s = 40 Mbps. Plenty for 1080p at 5 Mbps. If the buffer drops below 5 seconds, it drops to a lower quality to refill fast.
HLS and DASH both implement ABR with different file formats:
- HLS (Apple):
.m3u8playlists,.m4ssegments. Default on iOS and Safari. - DASH (ISO standard):
.mpdXML manifests,.m4ssegments. Default on Android.
Modern platforms write segments in .m4s (the CMAF format) and produce both an HLS playlist and a DASH manifest pointing at the same byte files. One set of segments, two manifests, all devices covered.
7. The full architecture
flowchart TB
subgraph Edge["Client edge"]
C(["Creator / Viewer"]):::user
GW["API Gateway<br/>(auth · rate limit · routing)"]:::edge
end
subgraph UploadPipeline["Upload pipeline"]
U["Upload Service<br/>(TUS / S3 multipart)"]:::app
S3src[("S3 source<br/>hot→IA at 30d→Glacier at 90d")]:::db
K{{"Kafka<br/>transcode.requested<br/>transcode.completed<br/>video.published"}}:::queue
TW["Transcoding Farm<br/>H.264 CPU · H.265 NVENC · AV1 CPU"]:::app
MB["Manifest Builder"]:::app
DB[("Postgres<br/>videos · variants · manifests")]:::db
end
subgraph PlaybackPath["Playback path"]
WA["Watch Page API<br/>(metadata + sign URLs)"]:::app
RD[("Redis<br/>hot metadata cache")]:::cache
end
S3out[("S3 segments<br/>hot/warm/cold tiered")]:::db
subgraph CDNStack["CDN (3-tier)"]
EP["Edge PoPs<br/>(200+ worldwide, ~95% hit)"]:::edge
SH["Regional Shields<br/>(10-20 worldwide, ~80% hit on misses)"]:::edge
end
subgraph TelemetryStack["Telemetry pipeline"]
TK{{"Kafka<br/>playback events"}}:::queue
FL["Flink<br/>(aggregate, 1-min windows)"]:::app
VDB[("Redis + ClickHouse<br/>view counts · analytics")]:::db
end
C --> GW
GW -->|upload| U
GW -->|watch| WA
GW -->|telemetry| TK
U --> S3src
U --> DB
U --> K
K --> TW
TW --> S3out
TW --> K
K --> MB
MB --> S3out
MB --> DB
WA --> DB
WA --> RD
WA -->|"signed URL"| EP
EP -.miss.-> SH
SH -.miss.-> S3out
TK --> FL
FL --> VDB
classDef user fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
classDef edge fill:#e2e8f0,stroke:#475569,color:#1e293b
classDef app fill:#dcfce7,stroke:#15803d,color:#14532d
classDef db fill:#fed7aa,stroke:#c2410c,color:#7c2d12
classDef cache fill:#fecaca,stroke:#b91c1c,color:#7f1d1d
classDef queue fill:#ddd6fe,stroke:#6d28d9,color:#4c1d95
Five things worth noticing:
- The Watch Page API is the only piece of your code on the playback hot path. Target P99 latency: 50ms.
- Kafka sits between every pipeline stage. A worker crash at 3 a.m. does not lose jobs.
- Transcoding workers are stateless. Restart any worker at any time. The job reruns from scratch (idempotent output makes this safe).
- Telemetry is fully async. If Flink has a lag spike, watch pages keep loading. View counts just update a bit late.
- Redis caches the most popular video metadata. For a video with 10M views/day, Postgres would be overwhelmed without it.
8. A playback request, end to end
sequenceDiagram
autonumber
participant Bob
participant GW as API Gateway
participant WA as Watch Page API
participant RD as Redis
participant DB as Postgres
participant CDN as CDN Edge PoP
participant S3out as S3 (origin)
Bob->>GW: GET /videos/v_8h3jK2p
GW->>WA: forward (auth ok, ~5ms)
WA->>RD: GET video:v_8h3jK2p
alt Redis hit (~90% for popular videos)
RD-->>WA: metadata (~1ms)
else cache miss
WA->>DB: SELECT video, manifests (~10ms)
DB-->>WA: rows
WA->>RD: SET video:v_8h3jK2p TTL=5min
end
WA-->>Bob: title, signed master.m3u8 URL (~50ms total)
Bob->>CDN: GET master.m3u8 (signed URL)
alt edge cache hit (~95%)
CDN-->>Bob: 1 KB playlist (~50ms)
else miss → shield → S3
CDN->>S3out: GET master.m3u8
S3out-->>CDN: file
CDN-->>Bob: 1 KB playlist
end
Bob->>CDN: GET 360p/seg_0.m4s
CDN-->>Bob: ~1 MB segment (6 sec, ~200ms)
Note over Bob: decodes, starts playing, measures 40 Mbps available
Bob->>CDN: GET 720p/seg_1.m4s
CDN-->>Bob: ~1.9 MB segment (~250ms)
Note over Bob,CDN: player continues fetching segments, adjusting quality as bandwidth changes
Latency budget for first frame:
| Step | Typical time | Notes |
|---|---|---|
| API call (Redis hit) | ~50ms | |
GET master.m3u8 (CDN edge hit) | ~50ms | Edge PoP in the same city |
GET playlist.m3u8 (CDN edge hit) | ~50ms | |
GET seg_0.m4s (CDN edge hit) | ~200ms | Depends on segment size and local bandwidth |
| First frame decode | ~50ms | |
| Total | ~400ms | Cache hits assumed |
A 2-second start-time SLO is achievable with cold-cache misses included.
9. The scaling journey: 100 users to 1 million
flowchart LR
S1["Stage 1<br/>100 creators<br/>Postgres + 1 worker<br/>~$100/mo"]:::s1
S2["Stage 2<br/>~1K creators<br/>+ Kafka + worker pool<br/>~$500/mo"]:::s2
S3["Stage 3<br/>~100K users<br/>+ CDN tiers + Redis<br/>+ storage tiering<br/>~$5K/mo"]:::s3
S4["Stage 4<br/>~1M+ users<br/>+ Cassandra + multi-region<br/>+ AV1 + ClickHouse"]:::s4
S1 --> S2 --> S3 --> S4
classDef s1 fill:#e0f2fe,stroke:#0369a1,color:#0c4a6e
classDef s2 fill:#dcfce7,stroke:#15803d,color:#14532d
classDef s3 fill:#fef3c7,stroke:#a16207,color:#713f12
classDef s4 fill:#fce7f3,stroke:#be185d,color:#831843
Stage 1: 100 creators
One Postgres, one transcoding worker, direct-to-S3 playback behind a basic CDN. Uploads are single-file HTTP posts. No queue. If the worker dies, restart it and resubmit. ~$100/month.
This is enough because you see maybe 50 uploads per day. Over-engineering at this stage wastes weeks.
Stage 2: ~1,000 creators
Something breaks: a popular creator uploads a 2 GB file, their Wi-Fi drops, they have to start over. Another problem: upload spikes during peak hours overwhelm the single worker.
Fix both at once: chunked resumable uploads (TUS or S3 multipart) and Kafka with a pool of workers. Workers autoscale on Kafka consumer lag. ~$500/month.
Stage 3: ~100,000 users
Several things break at once:
- Popular videos cause hundreds of requests per second to S3. S3 throttles.
- Creators complain the watch page is slow.
- Storage costs grow faster than revenue.
Fixes: three-tier CDN (edge + shield + origin), Redis for hot video metadata, storage tiering via S3 lifecycle rules. Cost jumps to ~$5K/month but CDN and tiering save 5-10x more than they cost.
Stage 4: ~1M+ users
New problems:
- Postgres starts struggling at ~100M video rows.
- A single region means EU users see high latency.
- AV1 starts making economic sense for the most-watched 1%.
Migrate to Cassandra sharded by video_id. Add a second region with cross-region S3 replication for the top 5% of hot content. Start the AV1 pipeline on a dedicated CPU pool. Add ClickHouse for creator analytics. The core pipeline shape has not changed since Stage 2. You are adding capacity and correctness around it.
10. The CDN in detail
| Tier | Location | Caches | TTL | Hit rate target |
|---|---|---|---|---|
| Edge PoPs | One per major city (200+ worldwide) | Hot segments and manifests for that city | 7 days for segments, 60s for master | 95%+ |
| Regional shields | One per cloud region (10-20 worldwide) | Everything any edge in this region fetched | 7 days | 80%+ on edge misses |
| Origin | S3 + signing service | Everything | Authoritative | N/A |
Why the shield changes the cost model: without it, 200 edge PoPs in a region missing the same segment each fire a separate S3 GET. That is 200 requests for one file. With the shield, the first miss does one S3 GET and serves the other 199 from its cache. S3 request volume scales with the number of distinct videos being watched, not the number of PoPs.
At 250 Tbps total egress with a 95% edge hit rate: S3 origin sees roughly 1% of total traffic. Without the CDN, S3 would need to serve all 250 Tbps. The CDN is not an optimization. It is what makes the numbers work at all.
11. Storage tiering
From the math in section 2:
- 70 PB new per year. If all stored at S3 Standard: $0.023/GB × 70,000 TB × 12 months = roughly $20M/year for one year of content. After 10 years: hundreds of millions.
- With tiering (5% hot / 15% warm / 80% cold): blended ~$0.005/GB. Same data costs roughly $4M/year. About 5x cheaper.
The tiering job runs nightly, reads last_viewed_at from ClickHouse, and emits S3 lifecycle transitions in bulk. When a cold video goes viral, it is promoted back to Standard within minutes.
Source files are never deleted. They move from Standard to Glacier after 30 days. When a new codec ships, you pull sources from Glacier and re-encode without quality loss.
12. Reliability
Worker crash mid-transcoding. Kafka redelivers the message. The worker re-runs and overwrites the same S3 keys. Output is idempotent; no partial segments survive.
Kafka broker loss. Replication factor 3. Leader election handles a single broker failure. Producers with idempotency enabled prevent duplicates on retry.
CDN PoP outage. The CDN’s own routing shifts traffic to neighboring PoPs. Latency rises for viewers in that city. Nothing else breaks.
Regional shield outage. Edge PoPs fall back to direct-to-S3. S3 request volume spikes 10-50x for cold-tail content. Edges serve cached segments via stale-while-revalidate; most viewers see no interruption. Long-tail videos may be unplayable until the shield recovers.
S3 outage. Switch signed URLs to point at a cross-region replica. Replication lag means videos uploaded in the last few minutes may 404. Accept this as a narrow window.
Metadata DB partial failure. At Cassandra scale, replication factor 3 per region survives node failures. A full-region failure means writes from that region stall; reads continue from other regions. At Postgres scale, a read replica handles the read side during primary recovery.
Janitor for stuck jobs. A periodic scan finds variants rows stuck in encoding longer than 2x the expected job duration and republishes them. Expected durations are estimated from source file size and codec.
13. Observability
| Metric | Why it matters |
|---|---|
upload.ingest_bytes_per_sec | Spike means attack; drop means auth is broken |
upload.chunk_fail_rate | High means upload service or S3 is degraded |
kafka.transcode.requested.lag | Autoscaler input; high means farm is undersized |
transcoder.job_duration by codec | Regression in ffmpeg or source quality |
transcoder.failure_rate | Bad sources, OOMs, or codec bugs |
manifest.assembly_lag | Time from last variant done to master published; slow hurts creator experience |
cdn.edge_hit_rate per PoP | The headline cost driver. Below 90%, investigate. |
cdn.shield_hit_rate | Below 70% means shield is undersized or TTLs are wrong |
cdn.s3_origin_request_rate | Should be ~1% of total. Spike means edge and shield both failed |
playback.first_frame_p99_ms | The viewer SLO. Page at >3,000ms for 5 minutes |
playback.rebuffer_ratio | Rebuffer seconds / playback seconds. Target <0.5% |
playback.quality_distribution | Viewers stuck on 240p = CDN issue or ABR bug |
view_counter.lag_seconds | If a video goes viral and shows 0 views, this is broken |
Page on: cdn.edge_hit_rate < 85% for 10 minutes. playback.first_frame_p99 > 3,000ms for 5 minutes. transcoder.failure_rate > 5%. s3_origin_request_rate > 10x baseline.
14. Follow-up answers
1. Delete a viral video from every cache.
Update videos.status=blocked in Postgres immediately. The Watch Page API now refuses to issue signed URLs. New viewers get “video unavailable.” Issue a CDN cache purge by URL pattern (/v_8h3jK2p/*). Major CDNs (Cloudflare, Fastly, Akamai, CloudFront) propagate purges in 5-30 seconds globally. For the most urgent takedowns (CSAM, court order), revoke the signing key for that video. Signed URLs in-flight stop working. Viewers mid-playback drain their buffer, then see an error. The 60-second target is reachable.
2. AV1 backfill for the top 10,000 videos.
Sort videos by total_watch_seconds descending (more meaningful than raw view count). Take the top 10K. Estimate savings per video: (h264_bitrate - av1_bitrate) × monthly_watch_seconds. Filter to videos where bandwidth savings exceed 10x the one-time encoding cost. Average video is 10 minutes. AV1 at 0.5x real-time = 20 minutes of compute per video. 10K videos × 20 min / 1,000 parallel cores is about 3 hours wall-clock. Publish 10K transcode.requested messages with priority=backfill (lower priority than live uploads). Manifest Builder adds the AV1 variant to the existing master playlist when done. Players that support AV1 pick it automatically; older players stick with H.264.
3. Live streaming.
Live and VOD share almost no infrastructure. Ingest changes from HTTP upload to RTMP or WebRTC pushed to a regional ingest server. Transcoding becomes real-time: each incoming GOP is encoded on the fly into the quality ladder, not post-hoc on a source file. The segment protocol changes to LL-HLS or LL-DASH with 1-2 second segments and HTTP chunked transfer, so the player can start downloading a segment while it is still being produced. CDN TTLs on live segments are seconds, not days. For 5 million concurrent viewers, LL-HLS at 3-5 seconds glass-to-glass is the practical answer. Sub-second latency via WebRTC does not scale beyond a few thousand concurrent viewers without a complex mesh.
When the stream ends, the live segments become the VOD asset with no re-transcoding needed.
4. Thumbnails at scale.
During transcoding, a second ffmpeg pass extracts frames: one at 10%, 50%, and 90% of duration for main thumbnails, and one every 2 seconds for the seek bar. A 4-minute video produces about 120 seek frames × 30 KB = 3.6 MB. At 450,000 videos/day, that is roughly 1.6 TB/day of thumbnail data. Pack seek frames into a sprite sheet (one image, CSS coordinates in a JSON descriptor). One CDN request instead of 120. Store in S3 alongside segments. Long TTL (1 year). Serve through the same CDN. Custom thumbnails uploaded by creators go through the Upload Service as a small file (no chunking needed) and overwrite the auto-generated main thumbnail.
5. Copyright takedown.
Trust-and-safety calls POST /admin/videos/<id>/block. This updates videos.status=blocked, revokes the signing key for that video_id, and triggers a CDN purge for /v_<id>/*. Do not delete the S3 objects. Move them to a restricted S3 bucket with a separate IAM policy that only the legal team and a logged-access process can read. Every block is an immutable audit record: who blocked, when, which notice, expected review date. If the takedown is withdrawn, restore the signing key and move files back. The 5-minute target is reachable with CDN purge propagation plus signing-key revocation.
6. Watch-time analytics.
Players emit heartbeats every few seconds with (video_id, session_id, position_seconds). Flink aggregates: for each video and each 10-second position bucket, count distinct sessions that reached it. The result is a retention curve: at position 0:00, 100% of sessions; at 3:47, maybe 60%; at the end, maybe 30%. Store in ClickHouse, partitioned by video_id, bucketed by position. Query latency is in the hundreds of milliseconds even on large data. Creators see the curve in their Studio dashboard. Updating nightly is fine; real-time would not change creator decisions.
7. DRM (Netflix mode).
Every transcoded segment is encrypted with AES-128 in CTR mode using a per-content encryption key (CEK). The manifest references a license server URL. At playback start: (1) player fetches the manifest, (2) player requests a license, (3) license server verifies the user is subscribed and geo-allowed, then returns the CEK wrapped for this device with a TTL, (4) player decrypts segments on the fly. Add to the architecture: a License Server (stateless API), a Key Management Service (stores CEKs indexed by video_id), and integration with Widevine (Android/Chrome), FairPlay (Apple), and PlayReady (Windows). Per-session license issuance adds 50-200ms to start time. Manifests can be cached. Licenses cannot.
8. Multiple audio tracks and subtitles.
HLS master playlists have #EXT-X-MEDIA entries for each audio and subtitle track. Audio is transcoded as separate variant streams (one per language). Subtitles are WebVTT files served as their own playlists. The player downloads video + one audio track at a time. Mixing at encode time is not needed: 7 video variants × 3 audio = 10 total files, not 21. Subtitles are tiny and all 12 can be listed in the manifest without significant overhead.
9. Regional shield outage.
Every edge PoP behind the downed shield falls back to S3 for cache misses. S3 request volume spikes 10-50x. Edges have stale-while-revalidate set: they serve cached segments to in-progress viewers while attempting a background refresh. Most viewers see no interruption. Long-tail videos (not yet cached at the edge) may be unplayable until the shield recovers. Pre-provision higher S3 request rate limits when you deploy a shield so the spike does not exhaust your S3 concurrency quota.
10. Real-time view counts for ranking.
The current pipeline (player → Kafka → Flink 1-minute windows → ClickHouse) has 5-30 minutes of lag. For 5-second freshness, add a parallel high-frequency aggregator: the same Flink job with 5-second windows writing to Redis with INCR views:<video_id>. The dedup is best-effort (not exactly-once across Kafka rebalances). For a ranking signal, approximate is fine. The watch page reads from the slower exactly-once pipeline for displayed view counts; the recommendation team reads from the fast approximate store. Two pipelines, two guarantees.
15. Trade-offs worth saying out loud
Build vs buy the CDN. At Google’s scale, they built their own global edge network. Below about 10 Tbps sustained egress, buying Cloudflare or Fastly is cheaper because you avoid PoP capital costs and operations. Above that, owning the CDN starts to make sense. Name the break-even, not just “buy” or “build.”
Codec choice is a budget decision, not a technical one.
- H.264: encodes fast, decodes everywhere, file size baseline. Always ship this.
- H.265: 30% smaller files, 3x slower to encode, patent licensing fees are real, decoder support is near-universal. Worth the tradeoff for most platforms above early stage.
- AV1: 30% smaller than H.265, royalty-free, 15-30x slower to encode. Use only for the top 1-5% of videos by watch time, where bandwidth savings recoup the encode cost within days.
Storage tiering tuning. Demoting a video that gets re-promoted 30% of the time within 30 days means you demoted too early. Tune the threshold per content category: news videos age quickly (aggressive demote), cooking tutorials less so. Measure demote-then-repromote rates and adjust.
Single manifest vs progressive manifest. The Manifest Builder can publish a partial master after the first quality completes (360p only), then rewrite as 720p and 1080p finish. This cuts time from “upload finished” to “watchable” from 20 minutes to 5 minutes. The cost is that master.m3u8 needs a short CDN TTL (60 seconds) during transcoding, which means more origin requests for a video’s first hour. Worth it for creator experience.
What to revisit at 10x scale. Custom hardware encoding ASICs (similar to Google Argos). Per-shot encoding: analyze each scene and encode at the optimal bitrate for its complexity rather than a fixed ladder. Saves 20%+ bandwidth. Netflix does this. Move manifest building to CDN-edge computation, generating manifests per viewer device in real time rather than pre-building static files.
16. Common mistakes
Designing upload and playback as one system. They are two separate systems sharing a database. A single diagram for both gets confusing within ten minutes. Separate them from the first sentence.
Skipping CDN tiers. “Use a CDN” is not a design. The regional shield is the critical piece. Without it, S3 cannot absorb the request volume when edge caches miss. Mention all three tiers.
No segment alignment. Players switch qualities between segments. If 360p and 1080p cut at different timestamps, the switch glitches visually. Most candidates skip this. Interviewers at media companies always ask.
Hand-waving the transcoding compute. “We’ll have transcoders” is not enough. Mention separate node pools per codec, autoscaling on Kafka lag, idempotent output, and the AV1 cost story.
Forgetting resumable upload. A 5 GB single PUT fails on a phone. Chunked resumable upload is required for any platform that takes uploads over 100 MB.
No view-count pipeline. 125 million concurrent viewers each generating a view event. You cannot UPDATE a DB row per view. Kafka + Flink with windowed aggregation is the standard answer.
Ignoring storage tiering. 70 PB/year at S3 Standard prices is financially unsustainable. The 4-5x cost reduction from tiering is material.
Treating live as a minor variation on VOD. Live needs real-time transcoding, RTMP/WebRTC ingest, LL-HLS delivery, and a 3-5 second latency budget. It is a different system. Acknowledge it and scope it separately.
Overengineering thumbnails. They are small files behind a CDN. A sprite sheet handles the seek bar. Do not spend five minutes here when the interviewer wants to hear about CDN tiering or transcoding.
Missing DRM when the question is Netflix. Even if not asked, one sentence naming the license server and per-device key flow shows you know the difference between YouTube and Netflix architectures.