High Availability

Definition

Design for the assumption that components will fail. Measured as “nines” of uptime. Two complementary strategies: redundancy (more than one of everything) and fast detection + recovery.

Where it appears

🌐 Networking

  • VRRP / HSRP / GLBP — first-hop redundancy
  • Spanning Tree — L2 loop prevention with redundant links
  • ECMP / link aggregation — multi-path at L3/L2

🐧 Linux

  • Pacemaker + Corosync — cluster resource manager
  • Keepalived — VRRP + healthchecks
  • systemd restart policies

☁️ Cloud

  • Multi-AZ — availability zones per region
  • Multi-region — cross-region replication, DNS failover
  • ALB/NLB / Azure LB — health checks + target removal
  • Auto Scaling / VMSS — replace unhealthy instances

📦 Containers

  • Deployments — multiple replicas, rolling updates
  • PodDisruptionBudget — guarantee minimum available
  • Topology spread constraints — spread across zones

Patterns

  • Active/Active — all nodes serve traffic
  • Active/Passive — standby takes over on failover
  • N+1 vs N+M redundancy
  • Blast radius reduction — cells, shards, bulkheads

See also