Backup Fundamentals — RPO and RTO

Two numbers answer most of the design questions in backup and disaster recovery: how much data can you afford to lose (RPO) and how long can you be down (RTO). Everything else follows from those.

The two numbers

RPO — Recovery Point Objective

“If I lose everything right now, how far back is the last recoverable state?”

RPO is measured in time. It’s the worst-case gap between the moment of failure and the last restorable copy. RPO = 1 hour means you can tolerate losing up to 1 hour of data.

  • RPO is driven by how often you back up. Backups every 24 h → RPO ≤ 24 h.
  • RPO = 0 requires synchronous replication — every write is committed to both sides before it’s acknowledged.
  • Low RPO costs money: more storage, more bandwidth, more complex replication.

RTO — Recovery Time Objective

“If we have to recover, how long until the service is running again?”

RTO is measured in time. The clock starts at the outage and stops when the service is usable again. RTO = 4 hours means you have 4 hours from “it broke” to “it works.”

  • RTO is driven by how fast you can restore. Pulling tapes from offsite → measured in days. Hot standby → measured in seconds.
  • Low RTO costs money: warm or hot standby systems, automated failover, more skilled operators on call.

The two are independent

You can have:

  • Low RPO + high RTO — “we have near-zero data loss, but it takes us two days to bring it back up” (typical of replicated backups you then need to restore onto new hardware)
  • High RPO + low RTO — “we lose a day of data in a disaster but the site is back up in 15 minutes” (hot standby with stale data)

Design for both, explicitly.

Backup types

Full backup

Copy everything, every time.

  • Simple, slow, storage-heavy
  • Restore is one step — just the latest full backup

Incremental backup

Copy only what has changed since the last backup of any kind.

  • Fast, small per run
  • Restore requires the last full + every incremental since
  • Chain failures (one missing incremental = later ones unusable)

Differential backup

Copy everything that has changed since the last full backup.

  • Between full and incremental in size/speed
  • Restore requires only the last full + the latest differential — much simpler chain
  • Grows over time until the next full

Typical rotation

A classic enterprise pattern:

  • Full backup weekly (e.g., Sunday)
  • Differential daily
  • Incremental hourly (optional)

Modern practice (in cloud / snapshot-driven systems) is more often continuous incremental forever with synthetic fulls.

The 3-2-1 rule

The single most important rule in backups:

3 copies of your data 2 different media types 1 copy offsite

Why it works:

  • 3 copies — the original plus two backups. One backup can silently corrupt; you still have another.
  • 2 media types — e.g., disk + tape, or online + cloud. A bug that eats “all your disks” shouldn’t take your tapes with it.
  • 1 offsite — fire, flood, theft, ransomware encrypting your entire on-prem network.

Modern variants:

  • 3-2-1-1-0 — adds 1 immutable/air-gapped copy and 0 errors on verification.

Snapshots vs backups

These get confused constantly:

SnapshotBackup
WhereSame storage systemSeparate system
MechanismCopy-on-write metadataCopy of data to another medium
SpeedInstantSlow (scales with data size)
Protects againstFat-finger delete, recent corruptionSite loss, storage failure, ransomware
Survives if storage array diesNoYes
Counts toward “3-2-1”NoYes

A snapshot is a useful first recovery tier — seconds to restore from, near-zero RPO. But if the underlying storage is lost, all snapshots go with it. You still need real backups.

Replication vs backup

Replication keeps a remote copy of current data — useful for DR failover and low RTO. But it’s not a backup:

  • Deletions replicate. Corruption replicates. Ransomware replicates.
  • You need backups that are point-in-time and immutable to survive these.

Common pattern: synchronous replication within a metro area (low RPO/RTO for site failure) + traditional backup for point-in-time recovery.

Testing is the whole game

An untested backup is not a backup — it’s a belief. The number of organisations who discovered their backups were broken during a crisis is huge.

  • Test restores regularly. Quarterly is a minimum; monthly is better.
  • Test at scale. A 1 GB file restore doesn’t prove a 2 TB database will restore within RTO.
  • Document the runbook. “It worked when Dave did it” is not a recovery plan.
  • DR drills — full failover to the DR site. Painful. Necessary.

Ransomware-era considerations

  • Immutable storage — once written, can’t be modified or deleted for a retention period. Object Lock (S3), WORM tapes, vendor immutability features.
  • Air gap — physical or logical isolation so the backup system is unreachable from compromised production.
  • Credentials separation — backup admin credentials must not be harvestable from production systems.
  • Test with bad assumptions — assume production is fully compromised; can you still restore?

The “tiering” cheat sheet

TierRPORTOMechanism
0 — synchronous replication~0secondsStretched cluster, sync replication
1 — async replication + hot standbyseconds to minutesminutesReplicated array, pilot-light VMs
2 — warm standby + snapshotsminutes to hourshoursSnapshot replication, spin-up on demand
3 — backupshours to dayshours to daysTraditional backup restore
4 — cold / offsite / tapedaysdays to weeksTape archives, offsite retrieval

Most applications get a mix — the critical subset at tier 0 or 1, everything else at tier 2 or 3.

See also