Replication & Fault Tolerance
Why This Matters
Replication is how you achieve reliability, durability, and read scalability. Fault tolerance is what keeps systems running when things break — and things ALWAYS break at scale.
Replication Strategies
Single-Leader (Primary-Secondary)
All Writes → Primary → Replication Log → Secondary 1
→ Secondary 2
→ Secondary 3
Reads → Any node (potentially stale from secondaries)
Write-Ahead Log (WAL) Shipping:
- Primary writes to WAL first, then applies
- Ships WAL to secondaries
- Used by: PostgreSQL, MySQL
Statement-Based Replication:
- Primary sends SQL statements to secondaries
- Problem: non-deterministic functions (NOW(), RAND())
- Largely deprecated
Row-Based (Logical) Replication:
- Ships actual data changes (insert row X, update column Y)
- Deterministic, used by most modern systems
- Enables Change Data Capture (CDC)
Multi-Leader
Client (US) → Leader US ↔ Leader EU ← Client (EU)
↕ ↕
Follower US Follower EU
Use cases:
- Multi-datacenter operation (write locally, replicate globally)
- Offline clients (each device is a "leader")
- Collaborative editing
Conflict handling:
- Avoid: route all writes for same data to same leader
- Detect: version vectors, timestamps
- Resolve: LWW, merge, custom logic
Leaderless (Dynamo-Style)
Client → Write to W nodes
Client → Read from R nodes → Resolve conflicts
Quorum formula: W + R > N
- N=3, W=2, R=2 → guaranteed overlap (strong reads)
- N=3, W=1, R=1 → no overlap (fast but eventually consistent)
Anti-entropy mechanisms:
- Read repair: On read, detect stale replica, update it
- Merkle trees: Compare data fingerprints to find inconsistencies
- Hinted handoff: If target node is down, another stores hint, delivers later
Failure Modes
Types of Failures
| Type | Description | Example |
|---|---|---|
| Crash failure | Node stops completely | Server power failure |
| Omission failure | Node fails to send/receive | Network packet loss |
| Timing failure | Node responds too slowly | GC pause, disk I/O |
| Byzantine failure | Node behaves arbitrarily/maliciously | Bug, hacked node |
Failure Detection
Challenge: Can't distinguish between a crashed node and a slow node.
Heartbeat-based detection:
Node A → Heartbeat every 1s → Monitor
If no heartbeat for 5s → Suspect failure
If no heartbeat for 10s → Declare failed
Problems with timeout-based detection:
- Too short → false positives (network blip marks healthy node as dead)
- Too long → slow failure detection
- Phi Accrual Failure Detector (Cassandra) — adaptive thresholds based on historical heartbeat intervals
Fault Tolerance Patterns
Redundancy Levels
| Level | Description | Example |
|---|---|---|
| Active-Passive | Standby takes over on failure | Database failover |
| Active-Active | All nodes serve traffic | Multi-leader replication |
| N+1 | One spare for N active | 3 servers + 1 standby |
| N+2 | Two spares (survives 2 failures) | Critical financial systems |
Circuit Breaker
States:
CLOSED → normal operation, track failure count
→ failure threshold exceeded →
OPEN → all requests fail fast (don't hit failing service)
→ timeout expires →
HALF-OPEN → allow one test request
→ success → CLOSED
→ failure → OPEN
Implementation: Track failure rate over sliding window. Libraries: Hystrix (Netflix), Resilience4j, Polly.
Bulkhead
Isolate failures to prevent cascade:
Thread Pool A: User Service (10 threads)
Thread Pool B: Payment Service (10 threads)
Thread Pool C: Search Service (10 threads)
If Payment Service hangs → only its 10 threads blocked
User Service and Search still work!
Retry with Exponential Backoff
Attempt 1: immediate
Attempt 2: wait 1s
Attempt 3: wait 2s
Attempt 4: wait 4s
Attempt 5: wait 8s + jitter (random 0-1s)
Jitter is critical: Without it, all clients retry at the same time → thundering herd.
Timeout
- Always set timeouts on external calls
- Connection timeout: how long to establish connection (short: 1-5s)
- Read timeout: how long to wait for response (depends on operation)
- No timeout = thread/connection leak risk
Idempotency
Make operations safe to retry:
POST /payments
Idempotency-Key: "abc-123"
First call: processes payment, returns 200
Retry (same key): returns same 200 (no double charge)
Implementation: Store idempotency key + result. On retry, return stored result.
Failover Strategies
Planned Failover
- Graceful switchover for maintenance
- Drain connections from old primary → promote new primary
Unplanned Failover
1. Detect failure (heartbeat timeout)
2. Choose new primary (most up-to-date replica)
3. Reconfigure clients to point to new primary
4. Old primary joins as replica when it recovers
Failover Challenges
- Data loss: Async replication means some writes may not have reached replica
- Split brain: Old primary comes back, thinks it's still leader
- Cascading failures: Failover increases load on remaining nodes → they fail too
- Client redirection: Clients need to discover new primary (DNS update, service discovery)
Chaos Engineering
Philosophy
"If something can break, it will. Better to break it deliberately and learn."
Netflix Chaos Monkey
- Randomly kills production instances
- Forces teams to build resilient systems
- Part of Netflix's "Simian Army"
Chaos Engineering Principles
- Define steady state (normal behavior metrics)
- Hypothesize: "system will continue working if X fails"
- Inject failure (kill instance, introduce latency, corrupt network)
- Observe: did system maintain steady state?
- Fix weaknesses found
Common Chaos Experiments
| Experiment | Tests |
|---|---|
| Kill random instance | Auto-recovery, health checks |
| Add network latency | Timeout handling, circuit breakers |
| Fill disk | Storage monitoring, cleanup |
| CPU stress | Auto-scaling, load balancing |
| Kill entire AZ | Multi-AZ resilience |
| DNS failure | Fallback, caching |
Disaster Recovery
RTO vs RPO
- RTO (Recovery Time Objective): Maximum acceptable downtime
- RPO (Recovery Point Objective): Maximum acceptable data loss
|------ data loss (RPO) ------|-- downtime (RTO) --|
Last backup Disaster Recovery
DR Strategies (Increasing Cost & Speed)
| Strategy | RTO | RPO | Cost |
|---|---|---|---|
| Backup & Restore | Hours | Hours | $ |
| Pilot Light | Minutes | Seconds | $$ |
| Warm Standby | Minutes | Seconds | $$$ |
| Multi-Site Active-Active | Near-zero | Near-zero | $$$$ |
Resources
- 📖 DDIA Chapter 5: Replication
- 📖 DDIA Chapter 8: The Trouble with Distributed Systems
- 📖 "Release It!" by Michael Nygard (circuit breakers, bulkheads)
- 🔗 Netflix Chaos Engineering
- 🔗 AWS Well-Architected: Reliability Pillar
- 🎥 Martin Kleppmann — Distributed Systems lectures
Previous: 11 - Distributed Consensus | Next: 13 - Distributed Transactions