Replication & Fault Tolerance

Why This Matters

Replication is how you achieve reliability, durability, and read scalability. Fault tolerance is what keeps systems running when things break — and things ALWAYS break at scale.

Replication Strategies

Single-Leader (Primary-Secondary)

All Writes → Primary → Replication Log → Secondary 1
                                        → Secondary 2
                                        → Secondary 3
Reads → Any node (potentially stale from secondaries)

Write-Ahead Log (WAL) Shipping:

Primary writes to WAL first, then applies
Ships WAL to secondaries
Used by: PostgreSQL, MySQL

Statement-Based Replication:

Primary sends SQL statements to secondaries
Problem: non-deterministic functions (NOW(), RAND())
Largely deprecated

Row-Based (Logical) Replication:

Ships actual data changes (insert row X, update column Y)
Deterministic, used by most modern systems
Enables Change Data Capture (CDC)

Multi-Leader

Client (US) → Leader US ↔ Leader EU ← Client (EU)
                ↕           ↕
            Follower US  Follower EU

Use cases:

Multi-datacenter operation (write locally, replicate globally)
Offline clients (each device is a "leader")
Collaborative editing

Conflict handling:

Avoid: route all writes for same data to same leader
Detect: version vectors, timestamps
Resolve: LWW, merge, custom logic

Leaderless (Dynamo-Style)

Client → Write to W nodes
Client → Read from R nodes → Resolve conflicts

Quorum formula: W + R > N

N=3, W=2, R=2 → guaranteed overlap (strong reads)
N=3, W=1, R=1 → no overlap (fast but eventually consistent)

Anti-entropy mechanisms:

Read repair: On read, detect stale replica, update it
Merkle trees: Compare data fingerprints to find inconsistencies
Hinted handoff: If target node is down, another stores hint, delivers later

Failure Modes

Types of Failures

Type	Description	Example
Crash failure	Node stops completely	Server power failure
Omission failure	Node fails to send/receive	Network packet loss
Timing failure	Node responds too slowly	GC pause, disk I/O
Byzantine failure	Node behaves arbitrarily/maliciously	Bug, hacked node

Failure Detection

Challenge: Can't distinguish between a crashed node and a slow node.

Heartbeat-based detection:

Node A → Heartbeat every 1s → Monitor
If no heartbeat for 5s → Suspect failure
If no heartbeat for 10s → Declare failed

Problems with timeout-based detection:

Too short → false positives (network blip marks healthy node as dead)
Too long → slow failure detection
Phi Accrual Failure Detector (Cassandra) — adaptive thresholds based on historical heartbeat intervals

Fault Tolerance Patterns

Redundancy Levels

Level	Description	Example
Active-Passive	Standby takes over on failure	Database failover
Active-Active	All nodes serve traffic	Multi-leader replication
N+1	One spare for N active	3 servers + 1 standby
N+2	Two spares (survives 2 failures)	Critical financial systems

Circuit Breaker

States:
  CLOSED → normal operation, track failure count
    → failure threshold exceeded →
  OPEN → all requests fail fast (don't hit failing service)
    → timeout expires →
  HALF-OPEN → allow one test request
    → success → CLOSED
    → failure → OPEN

Implementation: Track failure rate over sliding window. Libraries: Hystrix (Netflix), Resilience4j, Polly.

Bulkhead

Isolate failures to prevent cascade:

Thread Pool A: User Service (10 threads)
Thread Pool B: Payment Service (10 threads)
Thread Pool C: Search Service (10 threads)

If Payment Service hangs → only its 10 threads blocked
User Service and Search still work!

Retry with Exponential Backoff

Attempt 1: immediate
Attempt 2: wait 1s
Attempt 3: wait 2s
Attempt 4: wait 4s
Attempt 5: wait 8s + jitter (random 0-1s)

Jitter is critical: Without it, all clients retry at the same time → thundering herd.

Timeout

Always set timeouts on external calls
Connection timeout: how long to establish connection (short: 1-5s)
Read timeout: how long to wait for response (depends on operation)
No timeout = thread/connection leak risk

Idempotency

Make operations safe to retry:

POST /payments
Idempotency-Key: "abc-123"

First call: processes payment, returns 200
Retry (same key): returns same 200 (no double charge)

Implementation: Store idempotency key + result. On retry, return stored result.

Failover Strategies

Planned Failover

Graceful switchover for maintenance
Drain connections from old primary → promote new primary

Unplanned Failover

1. Detect failure (heartbeat timeout)
2. Choose new primary (most up-to-date replica)
3. Reconfigure clients to point to new primary
4. Old primary joins as replica when it recovers

Failover Challenges

Data loss: Async replication means some writes may not have reached replica
Split brain: Old primary comes back, thinks it's still leader
Cascading failures: Failover increases load on remaining nodes → they fail too
Client redirection: Clients need to discover new primary (DNS update, service discovery)

Chaos Engineering

Philosophy

"If something can break, it will. Better to break it deliberately and learn."

Netflix Chaos Monkey

Randomly kills production instances
Forces teams to build resilient systems
Part of Netflix's "Simian Army"

Chaos Engineering Principles

Define steady state (normal behavior metrics)
Hypothesize: "system will continue working if X fails"
Inject failure (kill instance, introduce latency, corrupt network)
Observe: did system maintain steady state?
Fix weaknesses found

Common Chaos Experiments

Experiment	Tests
Kill random instance	Auto-recovery, health checks
Add network latency	Timeout handling, circuit breakers
Fill disk	Storage monitoring, cleanup
CPU stress	Auto-scaling, load balancing
Kill entire AZ	Multi-AZ resilience
DNS failure	Fallback, caching

Disaster Recovery

RTO vs RPO

RTO (Recovery Time Objective): Maximum acceptable downtime
RPO (Recovery Point Objective): Maximum acceptable data loss

|------ data loss (RPO) ------|-- downtime (RTO) --|
Last backup                 Disaster          Recovery

DR Strategies (Increasing Cost & Speed)

Strategy	RTO	RPO	Cost
Backup & Restore	Hours	Hours	$
Pilot Light	Minutes	Seconds	$$
Warm Standby	Minutes	Seconds	$$$
Multi-Site Active-Active	Near-zero	Near-zero	$$$$

Resources

📖 DDIA Chapter 5: Replication
📖 DDIA Chapter 8: The Trouble with Distributed Systems
📖 "Release It!" by Michael Nygard (circuit breakers, bulkheads)
🔗 Netflix Chaos Engineering
🔗 AWS Well-Architected: Reliability Pillar
🎥 Martin Kleppmann — Distributed Systems lectures

Previous: 11 - Distributed Consensus | Next: 13 - Distributed Transactions