Backup Fundamentals — RPO and RTO
Two numbers answer most of the design questions in backup and disaster recovery: how much data can you afford to lose (RPO) and how long can you be down (RTO). Everything else follows from those.
The two numbers
RPO — Recovery Point Objective
“If I lose everything right now, how far back is the last recoverable state?”
RPO is measured in time. It’s the worst-case gap between the moment of failure and the last restorable copy. RPO = 1 hour means you can tolerate losing up to 1 hour of data.
- RPO is driven by how often you back up. Backups every 24 h → RPO ≤ 24 h.
- RPO = 0 requires synchronous replication — every write is committed to both sides before it’s acknowledged.
- Low RPO costs money: more storage, more bandwidth, more complex replication.
RTO — Recovery Time Objective
“If we have to recover, how long until the service is running again?”
RTO is measured in time. The clock starts at the outage and stops when the service is usable again. RTO = 4 hours means you have 4 hours from “it broke” to “it works.”
- RTO is driven by how fast you can restore. Pulling tapes from offsite → measured in days. Hot standby → measured in seconds.
- Low RTO costs money: warm or hot standby systems, automated failover, more skilled operators on call.
The two are independent
You can have:
- Low RPO + high RTO — “we have near-zero data loss, but it takes us two days to bring it back up” (typical of replicated backups you then need to restore onto new hardware)
- High RPO + low RTO — “we lose a day of data in a disaster but the site is back up in 15 minutes” (hot standby with stale data)
Design for both, explicitly.
Backup types
Full backup
Copy everything, every time.
- Simple, slow, storage-heavy
- Restore is one step — just the latest full backup
Incremental backup
Copy only what has changed since the last backup of any kind.
- Fast, small per run
- Restore requires the last full + every incremental since
- Chain failures (one missing incremental = later ones unusable)
Differential backup
Copy everything that has changed since the last full backup.
- Between full and incremental in size/speed
- Restore requires only the last full + the latest differential — much simpler chain
- Grows over time until the next full
Typical rotation
A classic enterprise pattern:
- Full backup weekly (e.g., Sunday)
- Differential daily
- Incremental hourly (optional)
Modern practice (in cloud / snapshot-driven systems) is more often continuous incremental forever with synthetic fulls.
The 3-2-1 rule
The single most important rule in backups:
3 copies of your data 2 different media types 1 copy offsite
Why it works:
- 3 copies — the original plus two backups. One backup can silently corrupt; you still have another.
- 2 media types — e.g., disk + tape, or online + cloud. A bug that eats “all your disks” shouldn’t take your tapes with it.
- 1 offsite — fire, flood, theft, ransomware encrypting your entire on-prem network.
Modern variants:
- 3-2-1-1-0 — adds 1 immutable/air-gapped copy and 0 errors on verification.
Snapshots vs backups
These get confused constantly:
| Snapshot | Backup | |
|---|---|---|
| Where | Same storage system | Separate system |
| Mechanism | Copy-on-write metadata | Copy of data to another medium |
| Speed | Instant | Slow (scales with data size) |
| Protects against | Fat-finger delete, recent corruption | Site loss, storage failure, ransomware |
| Survives if storage array dies | No | Yes |
| Counts toward “3-2-1” | No | Yes |
A snapshot is a useful first recovery tier — seconds to restore from, near-zero RPO. But if the underlying storage is lost, all snapshots go with it. You still need real backups.
Replication vs backup
Replication keeps a remote copy of current data — useful for DR failover and low RTO. But it’s not a backup:
- Deletions replicate. Corruption replicates. Ransomware replicates.
- You need backups that are point-in-time and immutable to survive these.
Common pattern: synchronous replication within a metro area (low RPO/RTO for site failure) + traditional backup for point-in-time recovery.
Testing is the whole game
An untested backup is not a backup — it’s a belief. The number of organisations who discovered their backups were broken during a crisis is huge.
- Test restores regularly. Quarterly is a minimum; monthly is better.
- Test at scale. A 1 GB file restore doesn’t prove a 2 TB database will restore within RTO.
- Document the runbook. “It worked when Dave did it” is not a recovery plan.
- DR drills — full failover to the DR site. Painful. Necessary.
Ransomware-era considerations
- Immutable storage — once written, can’t be modified or deleted for a retention period. Object Lock (S3), WORM tapes, vendor immutability features.
- Air gap — physical or logical isolation so the backup system is unreachable from compromised production.
- Credentials separation — backup admin credentials must not be harvestable from production systems.
- Test with bad assumptions — assume production is fully compromised; can you still restore?
The “tiering” cheat sheet
| Tier | RPO | RTO | Mechanism |
|---|---|---|---|
| 0 — synchronous replication | ~0 | seconds | Stretched cluster, sync replication |
| 1 — async replication + hot standby | seconds to minutes | minutes | Replicated array, pilot-light VMs |
| 2 — warm standby + snapshots | minutes to hours | hours | Snapshot replication, spin-up on demand |
| 3 — backups | hours to days | hours to days | Traditional backup restore |
| 4 — cold / offsite / tape | days | days to weeks | Tape archives, offsite retrieval |
Most applications get a mix — the critical subset at tier 0 or 1, everything else at tier 2 or 3.
See also
- RAID Levels — RAID protects against drive failure, backups protect everything else
- SAN vs NAS
- High Availability
- 🖥️ Server Infrastructure MOC