High Availability
Definition
Design for the assumption that components will fail. Measured as “nines” of uptime. Two complementary strategies: redundancy (more than one of everything) and fast detection + recovery.
Where it appears
🌐 Networking
- VRRP / HSRP / GLBP — first-hop redundancy
- Spanning Tree — L2 loop prevention with redundant links
- ECMP / link aggregation — multi-path at L3/L2
🐧 Linux
- Pacemaker + Corosync — cluster resource manager
- Keepalived — VRRP + healthchecks
- systemd restart policies
☁️ Cloud
- Multi-AZ — availability zones per region
- Multi-region — cross-region replication, DNS failover
- ALB/NLB / Azure LB — health checks + target removal
- Auto Scaling / VMSS — replace unhealthy instances
📦 Containers
- Deployments — multiple replicas, rolling updates
- PodDisruptionBudget — guarantee minimum available
- Topology spread constraints — spread across zones
Patterns
- Active/Active — all nodes serve traffic
- Active/Passive — standby takes over on failover
- N+1 vs N+M redundancy
- Blast radius reduction — cells, shards, bulkheads