Replication & Fault Tolerance

Why This Matters

Replication is how you achieve reliability, durability, and read scalability. Fault tolerance is what keeps systems running when things break — and things ALWAYS break at scale.


Replication Strategies

Single-Leader (Primary-Secondary)

All Writes → Primary → Replication Log → Secondary 1
                                        → Secondary 2
                                        → Secondary 3
Reads → Any node (potentially stale from secondaries)

Write-Ahead Log (WAL) Shipping:

  • Primary writes to WAL first, then applies
  • Ships WAL to secondaries
  • Used by: PostgreSQL, MySQL

Statement-Based Replication:

  • Primary sends SQL statements to secondaries
  • Problem: non-deterministic functions (NOW(), RAND())
  • Largely deprecated

Row-Based (Logical) Replication:

  • Ships actual data changes (insert row X, update column Y)
  • Deterministic, used by most modern systems
  • Enables Change Data Capture (CDC)

Multi-Leader

Client (US) → Leader US ↔ Leader EU ← Client (EU)
                ↕           ↕
            Follower US  Follower EU

Use cases:

  • Multi-datacenter operation (write locally, replicate globally)
  • Offline clients (each device is a "leader")
  • Collaborative editing

Conflict handling:

  • Avoid: route all writes for same data to same leader
  • Detect: version vectors, timestamps
  • Resolve: LWW, merge, custom logic

Leaderless (Dynamo-Style)

Client → Write to W nodes
Client → Read from R nodes → Resolve conflicts

Quorum formula: W + R > N

  • N=3, W=2, R=2 → guaranteed overlap (strong reads)
  • N=3, W=1, R=1 → no overlap (fast but eventually consistent)

Anti-entropy mechanisms:

  • Read repair: On read, detect stale replica, update it
  • Merkle trees: Compare data fingerprints to find inconsistencies
  • Hinted handoff: If target node is down, another stores hint, delivers later

Failure Modes

Types of Failures

TypeDescriptionExample
Crash failureNode stops completelyServer power failure
Omission failureNode fails to send/receiveNetwork packet loss
Timing failureNode responds too slowlyGC pause, disk I/O
Byzantine failureNode behaves arbitrarily/maliciouslyBug, hacked node

Failure Detection

Challenge: Can't distinguish between a crashed node and a slow node.

Heartbeat-based detection:

Node A → Heartbeat every 1s → Monitor
If no heartbeat for 5s → Suspect failure
If no heartbeat for 10s → Declare failed

Problems with timeout-based detection:

  • Too short → false positives (network blip marks healthy node as dead)
  • Too long → slow failure detection
  • Phi Accrual Failure Detector (Cassandra) — adaptive thresholds based on historical heartbeat intervals

Fault Tolerance Patterns

Redundancy Levels

LevelDescriptionExample
Active-PassiveStandby takes over on failureDatabase failover
Active-ActiveAll nodes serve trafficMulti-leader replication
N+1One spare for N active3 servers + 1 standby
N+2Two spares (survives 2 failures)Critical financial systems

Circuit Breaker

States:
  CLOSED → normal operation, track failure count
    → failure threshold exceeded →
  OPEN → all requests fail fast (don't hit failing service)
    → timeout expires →
  HALF-OPEN → allow one test request
    → success → CLOSED
    → failure → OPEN

Implementation: Track failure rate over sliding window. Libraries: Hystrix (Netflix), Resilience4j, Polly.

Bulkhead

Isolate failures to prevent cascade:

Thread Pool A: User Service (10 threads)
Thread Pool B: Payment Service (10 threads)
Thread Pool C: Search Service (10 threads)

If Payment Service hangs → only its 10 threads blocked
User Service and Search still work!

Retry with Exponential Backoff

Attempt 1: immediate
Attempt 2: wait 1s
Attempt 3: wait 2s
Attempt 4: wait 4s
Attempt 5: wait 8s + jitter (random 0-1s)

Jitter is critical: Without it, all clients retry at the same time → thundering herd.

Timeout

  • Always set timeouts on external calls
  • Connection timeout: how long to establish connection (short: 1-5s)
  • Read timeout: how long to wait for response (depends on operation)
  • No timeout = thread/connection leak risk

Idempotency

Make operations safe to retry:

POST /payments
Idempotency-Key: "abc-123"

First call: processes payment, returns 200
Retry (same key): returns same 200 (no double charge)

Implementation: Store idempotency key + result. On retry, return stored result.


Failover Strategies

Planned Failover

  • Graceful switchover for maintenance
  • Drain connections from old primary → promote new primary

Unplanned Failover

1. Detect failure (heartbeat timeout)
2. Choose new primary (most up-to-date replica)
3. Reconfigure clients to point to new primary
4. Old primary joins as replica when it recovers

Failover Challenges

  • Data loss: Async replication means some writes may not have reached replica
  • Split brain: Old primary comes back, thinks it's still leader
  • Cascading failures: Failover increases load on remaining nodes → they fail too
  • Client redirection: Clients need to discover new primary (DNS update, service discovery)

Chaos Engineering

Philosophy

"If something can break, it will. Better to break it deliberately and learn."

Netflix Chaos Monkey

  • Randomly kills production instances
  • Forces teams to build resilient systems
  • Part of Netflix's "Simian Army"

Chaos Engineering Principles

  1. Define steady state (normal behavior metrics)
  2. Hypothesize: "system will continue working if X fails"
  3. Inject failure (kill instance, introduce latency, corrupt network)
  4. Observe: did system maintain steady state?
  5. Fix weaknesses found

Common Chaos Experiments

ExperimentTests
Kill random instanceAuto-recovery, health checks
Add network latencyTimeout handling, circuit breakers
Fill diskStorage monitoring, cleanup
CPU stressAuto-scaling, load balancing
Kill entire AZMulti-AZ resilience
DNS failureFallback, caching

Disaster Recovery

RTO vs RPO

  • RTO (Recovery Time Objective): Maximum acceptable downtime
  • RPO (Recovery Point Objective): Maximum acceptable data loss
|------ data loss (RPO) ------|-- downtime (RTO) --|
Last backup                 Disaster          Recovery

DR Strategies (Increasing Cost & Speed)

StrategyRTORPOCost
Backup & RestoreHoursHours$
Pilot LightMinutesSeconds$$
Warm StandbyMinutesSeconds$$$
Multi-Site Active-ActiveNear-zeroNear-zero$$$$

Resources


Previous: 11 - Distributed Consensus | Next: 13 - Distributed Transactions