Backup Fundamentals — RPO and RTO

Two numbers answer most of the design questions in backup and disaster recovery: how much data can you afford to lose (RPO) and how long can you be down (RTO). Everything else follows from those.

The two numbers

RPO — Recovery Point Objective

“If I lose everything right now, how far back is the last recoverable state?”

RPO is measured in time. It’s the worst-case gap between the moment of failure and the last restorable copy. RPO = 1 hour means you can tolerate losing up to 1 hour of data.

RPO is driven by how often you back up. Backups every 24 h → RPO ≤ 24 h.
RPO = 0 requires synchronous replication — every write is committed to both sides before it’s acknowledged.
Low RPO costs money: more storage, more bandwidth, more complex replication.

RTO — Recovery Time Objective

“If we have to recover, how long until the service is running again?”

RTO is measured in time. The clock starts at the outage and stops when the service is usable again. RTO = 4 hours means you have 4 hours from “it broke” to “it works.”

RTO is driven by how fast you can restore. Pulling tapes from offsite → measured in days. Hot standby → measured in seconds.
Low RTO costs money: warm or hot standby systems, automated failover, more skilled operators on call.

The two are independent

You can have:

Low RPO + high RTO — “we have near-zero data loss, but it takes us two days to bring it back up” (typical of replicated backups you then need to restore onto new hardware)
High RPO + low RTO — “we lose a day of data in a disaster but the site is back up in 15 minutes” (hot standby with stale data)

Design for both, explicitly.

Backup types

Full backup

Copy everything, every time.

Simple, slow, storage-heavy
Restore is one step — just the latest full backup

Incremental backup

Copy only what has changed since the last backup of any kind.

Fast, small per run
Restore requires the last full + every incremental since
Chain failures (one missing incremental = later ones unusable)

Differential backup

Copy everything that has changed since the last full backup.

Between full and incremental in size/speed
Restore requires only the last full + the latest differential — much simpler chain
Grows over time until the next full

Typical rotation

A classic enterprise pattern:

Full backup weekly (e.g., Sunday)
Differential daily
Incremental hourly (optional)

Modern practice (in cloud / snapshot-driven systems) is more often continuous incremental forever with synthetic fulls.

The 3-2-1 rule

The single most important rule in backups:

3 copies of your data 2 different media types 1 copy offsite

Why it works:

3 copies — the original plus two backups. One backup can silently corrupt; you still have another.
2 media types — e.g., disk + tape, or online + cloud. A bug that eats “all your disks” shouldn’t take your tapes with it.
1 offsite — fire, flood, theft, ransomware encrypting your entire on-prem network.

Modern variants:

3-2-1-1-0 — adds 1 immutable/air-gapped copy and 0 errors on verification.

Snapshots vs backups

These get confused constantly:

	Snapshot	Backup
Where	Same storage system	Separate system
Mechanism	Copy-on-write metadata	Copy of data to another medium
Speed	Instant	Slow (scales with data size)
Protects against	Fat-finger delete, recent corruption	Site loss, storage failure, ransomware
Survives if storage array dies	No	Yes
Counts toward “3-2-1”	No	Yes

A snapshot is a useful first recovery tier — seconds to restore from, near-zero RPO. But if the underlying storage is lost, all snapshots go with it. You still need real backups.

Replication vs backup

Replication keeps a remote copy of current data — useful for DR failover and low RTO. But it’s not a backup:

Deletions replicate. Corruption replicates. Ransomware replicates.
You need backups that are point-in-time and immutable to survive these.

Common pattern: synchronous replication within a metro area (low RPO/RTO for site failure) + traditional backup for point-in-time recovery.

Testing is the whole game

An untested backup is not a backup — it’s a belief. The number of organisations who discovered their backups were broken during a crisis is huge.

Test restores regularly. Quarterly is a minimum; monthly is better.
Test at scale. A 1 GB file restore doesn’t prove a 2 TB database will restore within RTO.
Document the runbook. “It worked when Dave did it” is not a recovery plan.
DR drills — full failover to the DR site. Painful. Necessary.

Ransomware-era considerations

Immutable storage — once written, can’t be modified or deleted for a retention period. Object Lock (S3), WORM tapes, vendor immutability features.
Air gap — physical or logical isolation so the backup system is unreachable from compromised production.
Credentials separation — backup admin credentials must not be harvestable from production systems.
Test with bad assumptions — assume production is fully compromised; can you still restore?

The “tiering” cheat sheet

Tier	RPO	RTO	Mechanism
0 — synchronous replication	~0	seconds	Stretched cluster, sync replication
1 — async replication + hot standby	seconds to minutes	minutes	Replicated array, pilot-light VMs
2 — warm standby + snapshots	minutes to hours	hours	Snapshot replication, spin-up on demand
3 — backups	hours to days	hours to days	Traditional backup restore
4 — cold / offsite / tape	days	days to weeks	Tape archives, offsite retrieval

Most applications get a mix — the critical subset at tier 0 or 1, everything else at tier 2 or 3.

IT Knowledge DB

Explorer

Backup Fundamentals — RPO and RTO

Backup Fundamentals — RPO and RTO

The two numbers

RPO — Recovery Point Objective

RTO — Recovery Time Objective

The two are independent

Backup types

Full backup

Incremental backup

Differential backup

Typical rotation

The 3-2-1 rule

Snapshots vs backups

Replication vs backup

Testing is the whole game

Ransomware-era considerations

The “tiering” cheat sheet

See also

Graph View

Table of Contents

Backlinks