Disaster recovery: RTO vs RPO
What promises you can keep and what you can't.
Disaster recovery is what you do after the disaster: a region outage, a deleted database, a ransomware attack, a fire in a data centre. It is not the same as everyday high availability; HA is for “the box died.” DR is for “the data centre is gone.” Every DR plan answers two numbers: RTO (how long until we are running again?) and RPO (how much data are we willing to lose?). Cheaper plans give you worse numbers. Expensive plans give you better ones. Pretending you do not need a plan gives you the worst numbers of all on the day you find out.
The two numbers
flowchart LR
T0(["Disaster strikes<br/>t = 0"]):::dead
RPO[("RPO<br/>= how much data can you afford to lose?<br/>e.g. 'no more than 5 minutes' = 5-minute RPO")]:::infra
RTO[("RTO<br/>= how long can you be down?<br/>e.g. 'back up in 1 hour' = 1-hour RTO")]:::infra
T0 -.->|"data from before t = 0<br/>but after the last replicated point<br/>is lost — measured by RPO"| RPO
T0 -.->|"time until you are<br/>serving traffic again<br/>measured by RTO"| RTO
classDef dead fill:#fecaca,stroke:#b91c1c,color:#7f1d1d,stroke-width:1.5px
classDef infra fill:#fef3c7,stroke:#a16207,color:#713f12,stroke-width:1.5px
- RPO (Recovery Point Objective): the maximum amount of data you are willing to lose. Measured in time. RPO = 1 hour means “we can tolerate losing the last hour of data.”
- RTO (Recovery Time Objective): the maximum amount of time you can be down. Measured in time. RTO = 30 minutes means “we must be serving traffic again within 30 minutes.”
Both numbers are business decisions, not engineering decisions. They translate directly into how much you have to spend on DR infrastructure.
The cost / time trade-off
flowchart TB
subgraph TIER1["Cold standby — cheap, slow"]
direction LR
C1[("backups only,<br/>no standby infra")]:::store
C2[("RTO: hours to days<br/>RPO: hours")]:::weak
end
subgraph TIER2["Pilot light — modest, moderate"]
direction LR
P1[("minimal infra running in DR region,<br/>data replicated continuously,<br/>scale up on activation")]:::store
P2[("RTO: ~30 min to 2 hours<br/>RPO: seconds to minutes")]:::mid
end
subgraph TIER3["Warm standby — significant, fast"]
direction LR
W1[("full but smaller copy running,<br/>data continuously replicated")]:::store
W2[("RTO: minutes<br/>RPO: seconds")]:::strong
end
subgraph TIER4["Hot standby / active-active — expensive, near-zero"]
direction LR
H1[("full identical infra running,<br/>traffic served from both,<br/>sync or near-sync replication")]:::store
H2[("RTO: near zero<br/>RPO: near zero")]:::strong
end
classDef store fill:#e9d5ff,stroke:#7e22ce,color:#581c87,stroke-width:1.5px
classDef weak fill:#fed7aa,stroke:#c2410c,color:#7c2d12,stroke-width:1.5px
classDef mid fill:#fef3c7,stroke:#a16207,color:#713f12,stroke-width:1.5px
classDef strong fill:#dcfce7,stroke:#15803d,color:#14532d,stroke-width:1.5px
Better RTO and RPO cost more infrastructure: closer to live, closer to running, closer to instant. Cheaper plans accept some downtime and some data loss to keep the spend reasonable.
What each tier looks like in practice
Cold standby (backup-only)
You restore from backup into a freshly-provisioned environment in another region. Hours of work, possibly days for very large data.
- RTO: 4 to 24 hours
- RPO: as fresh as your last backup (often 1-24 hours)
- Cost: cheapest. You pay only for backup storage.
Fine for internal tools, side projects, and workloads where a half-day outage is annoying but not catastrophic. Not fine for customer-facing revenue paths.
Pilot light
A skeleton of the production environment runs continuously in the DR region. Databases are replicated; application servers exist but in small numbers (or are turned off). On disaster, scale the application tier up, switch DNS or load balancer, traffic flows.
- RTO: 30 minutes to 2 hours
- RPO: seconds to minutes (continuous replication)
- Cost: modest. You pay for replication and a small footprint.
The right starting point for most SaaS companies serious about DR.
Warm standby
A full but possibly smaller copy of production runs in the DR region. Replication is continuous. On disaster, you flip traffic; the DR site is already capable of serving everything but at reduced capacity. Scale up while serving.
- RTO: minutes
- RPO: seconds
- Cost: substantial. Roughly half to full second copy of the production fleet.
This is what most large SaaS companies eventually move to.
Hot standby / active-active
Both regions are live and serving traffic. Replication is synchronous or near-synchronous. A regional failure removes one region from the rotation; the other absorbs the full load. See Multi-region.
- RTO: near zero
- RPO: near zero
- Cost: full duplicate infrastructure, plus the complexity tax of running active-active.
Reserved for workloads where seconds of downtime are unacceptable: payments, healthcare, large-scale platforms.
The picker
flowchart TB
Q1{"How much downtime<br/>can the business survive?"}:::query
Q2{"How much data loss<br/>can the business survive?"}:::query
A1["Hot standby / active-active.<br/>Near-zero RTO, near-zero RPO.<br/>Most expensive."]:::strong
A2["Warm standby.<br/>Minutes of RTO, seconds of RPO.<br/>Significant but bounded cost."]:::strong
A3["Pilot light.<br/>Hour-ish RTO, minutes of RPO.<br/>Good default for SaaS."]:::mid
A4["Cold standby.<br/>Hours of RTO, hours of RPO.<br/>Cheapest. Fine for some workloads."]:::weak
Q1 -->|"seconds"| A1
Q1 -->|"minutes"| A2
Q1 -->|"hours"| A3
Q1 -->|"days"| A4
classDef query fill:#dbeafe,stroke:#1e40af,color:#1e3a8a,stroke-width:1.5px
classDef strong fill:#dcfce7,stroke:#15803d,color:#14532d,stroke-width:1.5px
classDef mid fill:#fef3c7,stroke:#a16207,color:#713f12,stroke-width:1.5px
classDef weak fill:#fed7aa,stroke:#c2410c,color:#7c2d12,stroke-width:1.5px
The honest conversation is “for each system in this business, how long can it be down, and how much data can be lost?” Different systems get different answers. Payments often need hot standby. Internal HR tools can run on cold standby.
What a DR plan actually contains
- The two numbers. RTO and RPO, agreed with the business, written down.
- Trigger criteria. What event causes you to declare a disaster and start the runbook.
- The runbook itself. Step-by-step: who runs it, what commands, what order, what to verify at each step.
- Communication plan. Who tells customers, on what channels, when.
- Rehearsal schedule. Failed-over to DR at least once a quarter, preferably more.
A DR plan you have never executed is a hope. Rehearsals find the broken scripts, the expired credentials, the undocumented dependencies, the procedures that worked in development but do not work at 4 AM in a real outage.
Two scenarios
Scenario one: a SaaS at series-A.
Pilot light DR in another region. Backups archived hourly with WAL streaming for PITR. Quarterly DR drills: provision the application tier in the DR region, point at the replicated database, send 5% synthetic traffic, validate. Cost: a few percent of normal infrastructure spend. RTO: 60 to 90 minutes. RPO: under 5 minutes.
Scenario two: a payment processor.
Active-active across two regions. Synchronous replication between primaries (paying the cross-region latency tax per write). Regional traffic distributed via DNS and BGP. A region going dark removes it from the pool; the other region absorbs all traffic. RTO: seconds. RPO: zero acknowledged transactions lost. Cost: roughly 2x single-region. Justified because seconds of downtime cost more than the duplicate infrastructure does in a year.
What this connects to
- Replication vs backup. Both are inputs to DR; you need them combined. See Replication vs backup.
- Multi-region. Hot standby and active-active are multi-region patterns. See Multi-region.
- Storage tiers. Backups live in tiered cold storage; retention drives RPO and cost. See Hot, warm, cold storage tiers.
- CAP theorem. Synchronous replication across regions trades latency for RPO=0. See CAP theorem.
- Read replicas. Often double as DR copies. See Read replicas.
Common mistakes
- No DR plan at all. “We have backups” is not a DR plan.
- RTO and RPO never written down. Without the numbers, the conversation about cost never happens.
- A DR plan that has never been rehearsed. It almost certainly does not work. The first real run is the rehearsal.
- DR in the same region. “Multi-AZ” is not multi-region. A regional event takes both down.
- Forgetting the people part. Who runs the runbook at 3 AM? Is on-call paged? Is the runbook accessible without the production environment that just went dark?
- DNS failover slower than promised. TTLs, cached resolvers, client-side DNS caching. A 30-second failover can take ten minutes for many users.
- Backups that satisfy RPO but ignore restore time. A 24-hour restore destroys your RTO no matter how recent the backup.
Quick recap
- DR is for “the data centre is gone.” Two numbers define it: RTO (time to recover) and RPO (data loss tolerated).
- Cold standby is cheap and slow; active-active is expensive and near-zero. Pick by business tolerance.
- A DR plan has: numbers, triggers, runbook, communication, rehearsal.
- A plan never executed is a plan that does not work. Drill quarterly.
- The DR conversation is a business conversation in engineering terms; do not have it alone.
This concept sits in Stage 4 (Scaling and reliability) of the System Design Roadmap.