45 - Geo-Distributed Systems

Previous: 44 - Gossip Protocols & Membership | Next: 46 - ML System Design Basics


1. Why Go Multi-Region?

DriverExplanation
LatencyUsers in Tokyo shouldn't wait 200ms for a server in Virginia. Serve from the nearest region.
AvailabilityA single region outage (AWS us-east-1 goes down) shouldn't take the whole system offline.
ComplianceGDPR requires EU user data to stay in EU. Data sovereignty laws vary by country.
Disaster recoveryNatural disasters, power outages, fiber cuts can take out an entire region.
ScalabilityDistribute load across regions rather than vertically scaling a single region.

2. Active-Passive vs Active-Active

Active-Passive (Primary-Secondary)

                         Users
                      /    |    \
              US users  EU users  APAC users
                  |        |         |
                  v        v         v
            +----------+  Global LB routes all writes
            |          |  to primary region
            v          v
    +--------------+  +--------------+
    | US-EAST      |  | EU-WEST      |
    | (PRIMARY)    |  | (STANDBY)    |
    | Read + Write |  | Read only    |
    +---------+----+  +------+-------+
              |              ^
              | async        |
              | replication  |
              +--------------+
AspectDetail
WritesAll go to primary region
ReadsCan serve from any region
FailoverPromote standby to primary (minutes of downtime)
ComplexityLower
Data loss riskRPO > 0 (async replication lag)

Active-Active (Multi-Primary)

                         Users
                      /    |    \
              US users  EU users  APAC users
                  |        |         |
                  v        v         v
          +-----------+ +-----------+ +-----------+
          | US-EAST   | | EU-WEST   | | APAC-EAST |
          | Read+Write| | Read+Write| | Read+Write|
          +-----+-----+ +-----+-----+ +-----+-----+
                |              |              |
                +--------------+--------------+
                    bi-directional async
                    replication
AspectDetail
WritesAny region can accept writes
ReadsServed locally (lowest latency)
FailoverOther regions absorb traffic (near-zero downtime)
ComplexityHigh (conflict resolution required)
ConsistencyEventual (conflicts possible)

Comparison

CriterionActive-PassiveActive-Active
Write latencyHigh for remote usersLow (local writes)
Failover timeMinutesSeconds
Conflict resolutionNone neededRequired (CRDTs, LWW, app-level)
CostLower (standby underutilized)Higher (all regions active)
Data consistencyEasier (single writer)Harder (multi-writer)

Interview Tip

Most companies start active-passive and evolve to active-active only for specific high-traffic, latency-sensitive workloads. Full active-active is expensive and complex. Be ready to justify which approach fits the use case.


3. Data Replication Across Regions

Synchronous vs Asynchronous

Synchronous:
  US-EAST writes data
    |--- replicate to EU-WEST (wait for ACK) --- 80-150ms cross-region
    |--- replicate to APAC (wait for ACK) ------- 150-300ms cross-region
    |--- respond to client
  Total write latency: 150-300ms+ (bottleneck: farthest region)

Asynchronous:
  US-EAST writes data
    |--- respond to client immediately
    |--- replicate to EU-WEST in background
    |--- replicate to APAC in background
  Total write latency: ~5ms (local only)
  Trade-off: data may be lost if region fails before replication
ApproachWrite LatencyConsistencyData Loss Risk
SynchronousHigh (cross-region RTT)StrongNone
AsynchronousLow (local only)EventualWindow of replication lag
Semi-synchronousMedium (1 remote ACK)Bounded stalenessReduced

Cross-Region Latency Reference

RouteTypical RTT
US-East <-> US-West60-80ms
US-East <-> EU-West80-120ms
US-East <-> APAC150-250ms
EU-West <-> APAC200-300ms
Same region (AZ to AZ)1-2ms

4. Conflict Resolution for Multi-Region Writes

When two regions write to the same record concurrently:

US-EAST:  UPDATE user SET name = "Alice"  (timestamp: t1)
EU-WEST:  UPDATE user SET name = "Bob"    (timestamp: t2)

Both succeed locally. On replication, conflict detected.
StrategyHowTrade-off
Last-Writer-Wins (LWW)Higher timestamp winsSimple but data loss (losing write silently discarded)
CRDTsMerge-friendly data structuresNo data loss, limited data type support (see 43 - CRDT & Conflict-Free Replication)
Application-levelApp defines custom merge logicMost flexible, most complex
Conflict-free by designPartition data so each region owns a subsetNo conflicts possible, limits flexibility

Conflict-Free by Design (Recommended Starting Point)

Strategy: Route writes to the "home region" of the data

User created in EU -> EU-WEST is the home region for this user's data
All writes to this user go to EU-WEST
Other regions serve reads from local replica

Routing table:
  user_id % N -> home_region
  OR: user's registration country -> home_region

This avoids multi-writer conflicts entirely.

5. Global Load Balancing

                         DNS Query: api.example.com
                                |
                         +------v------+
                         | GeoDNS      |
                         | (Route 53)  |
                         +------+------+
                                |
              +-----------------+-----------------+
              |                 |                 |
      US user gets:     EU user gets:    APAC user gets:
      52.1.2.3          34.5.6.7         13.8.9.10
      (US-EAST LB)      (EU-WEST LB)    (APAC LB)
              |                 |                 |
      +-------v-------+ +------v------+ +--------v------+
      | Regional LB   | | Regional LB | | Regional LB   |
      | (US-EAST)     | | (EU-WEST)   | | (APAC-EAST)   |
      +---------------+ +-------------+ +---------------+

Routing Techniques

TechniqueHowLatencyFailover
GeoDNSDNS returns IP of nearest region based on client IPGoodSlow (DNS TTL, 30-300s)
AnycastSame IP announced from all regions, BGP routes to nearestBestFast (BGP convergence, seconds)
Global LB (L7)Single entry point, intelligent routing (e.g., Cloudflare, AWS Global Accelerator)GoodFast (health-check based)
Client-side routingClient SDK picks region based on latency probesBest (measured)Immediate

6. Data Placement & Sovereignty

GDPR and Data Sovereignty

GDPR Rule: EU personal data must be processed in compliance with EU law.
           In practice: store EU user data in EU regions.

Implementation:

  User registers from Germany
    -> user.home_region = EU-WEST
    -> PII stored only in EU-WEST
    -> Aggregated/anonymized data may be replicated globally

  Data classification:
    PII (name, email, address)        -> restricted to home region
    Non-PII (product views, clicks)   -> can replicate globally
    Aggregated metrics                 -> can replicate globally
Data TypeCan Replicate Globally?Notes
PII (personal data)No (unless compliant transfer mechanism)GDPR, CCPA, LGPD
Financial dataJurisdiction-dependentPCI DSS, local finance laws
Healthcare dataNoHIPAA (US), local equivalents
Anonymized/aggregatedYesNo longer personal data
Application configYesNon-sensitive

7. Designing for Regional Failure

Region US-EAST goes down:

  Before failure:
    US users -> US-EAST
    EU users -> EU-WEST
    APAC users -> APAC-EAST

  After failure:
    US users -> EU-WEST (failover, +80ms latency)
    EU users -> EU-WEST (unchanged)
    APAC users -> APAC-EAST (unchanged)

  Requirements for successful failover:
    1. Health checks detect US-EAST failure (seconds)
    2. DNS/LB routes US traffic to EU-WEST (seconds to minutes)
    3. EU-WEST has capacity to handle US + EU traffic
    4. EU-WEST has sufficiently recent data replica
    5. Write path redirected or queued

Capacity Planning for Failover

Rule of thumb: Each region should have 50% headroom

  Normal load per region:  1000 QPS
  Capacity per region:     1500 QPS  (50% headroom)
  
  When one of 3 regions fails:
    Surviving regions each absorb ~500 extra QPS
    New load: 1500 QPS each (at capacity, but functional)

Alternative: N+1 regions
  3 regions needed for normal traffic
  Deploy 4 regions (1 extra for failover capacity)

8. Traffic Management During Failover

Failover Timeline:

  t=0:    US-EAST health checks fail
  t=5s:   Health check confirms failure (3 consecutive failures)
  t=10s:  Global LB removes US-EAST from rotation
  t=10s:  US traffic routes to EU-WEST (nearest healthy)
  t=10s+: EU-WEST auto-scales to handle increased load

  DNS-based failover:
    Slower: DNS TTL must expire (30-300s)
    Mitigation: Use low TTL (30s) for critical services
    Or: Use anycast/global LB instead of DNS

  Client-side failover:
    Fastest: Client detects failure, switches endpoint
    Requires: Client SDK with failover logic

9. Consistency Trade-offs in Multi-Region

The spectrum:

  Strong Consistency                              Eventual Consistency
  |                                                              |
  Spanner (sync cross-region)    Bounded staleness    DynamoDB Global Tables
  Slow writes, no conflicts      Tunable              Fast writes, conflicts possible
ModelWrite LatencyRead FreshnessExample
Strong (synchronous)100-300msAlways freshGoogle Spanner
Bounded staleness50-100msStale by at most X secondsCockroachDB (follower reads)
Causal consistency10-50msCausally related reads are freshMongoDB sessions
Eventual1-10msMay read stale dataDynamoDB Global Tables

Interview Tip

Google Spanner achieves strong consistency across regions using TrueTime (GPS + atomic clocks for globally synchronized timestamps). This is hardware-dependent and unique to Google's infrastructure. CockroachDB approximates this with software-based uncertainty intervals. Most systems use eventual consistency because cross-region synchronous writes are too slow.


10. Cost Considerations

Cost FactorImpact
Data transferCross-region bandwidth: $0.02-0.09/GB (AWS). Replication generates significant transfer costs.
ComputeRunning full application stack in 3+ regions: 3x compute cost.
StorageFull replicas in each region: 3x storage cost.
DatabaseMulti-region databases (Spanner, CockroachDB) command premium pricing.
NetworkingGlobal load balancers, dedicated interconnects.

Cost Optimization Strategies

1. Replicate only what's needed
   - Full DB replica: expensive
   - Cache popular data in remote regions: cheaper
   - CDN for static content: cheapest

2. Tiered approach
   - Primary region: full stack
   - Secondary regions: read replicas + cache + CDN
   - Failover region: minimal standby, scale on demand

3. Reserved capacity in primary, spot/on-demand in secondary

4. Compress replication traffic (50-80% reduction)

11. Real-World Examples

Google Spanner

- Globally distributed, strongly consistent
- Uses TrueTime (GPS + atomic clocks) for global ordering
- Synchronous replication across regions
- Write latency: 10-100ms (within continent), 100-300ms (cross-continent)
- Used for: Google Ads, Google Play
- Externally consistent: strongest possible guarantee

CockroachDB

- Open-source Spanner-inspired database
- Software-based uncertainty intervals (no special hardware)
- Serializable isolation across regions
- Survival goals: region-level failure tolerance
- Follower reads: bounded-staleness reads from any region

DynamoDB Global Tables

- Active-active multi-region
- Eventual consistency for cross-region reads
- LWW conflict resolution (last writer wins)
- Replication lag: typically < 1 second
- Simple setup: enable Global Tables, select regions
- Trade-off: no strong consistency across regions

Netflix

- Active-active across 3 AWS regions (us-east, us-west, eu-west)
- Each region serves its local traffic
- Regional failure: traffic drains to other regions
- Zuul (API gateway) handles routing
- EVCache (Memcached-based) replicated across regions
- Cassandra for multi-region data with eventual consistency

12. Architecture Pattern: Multi-Region with Regional Data Ownership

                     +-------------------+
                     |  Global DNS /     |
                     |  GeoDNS (Route53) |
                     +--------+----------+
                              |
            +-----------------+-----------------+
            |                 |                 |
    +-------v-------+ +------v------+ +--------v------+
    | US-EAST       | | EU-WEST     | | APAC-EAST     |
    +---------------+ +-------------+ +---------------+
    | API Gateway   | | API Gateway | | API Gateway   |
    | App Servers   | | App Servers | | App Servers   |
    | Cache (Redis) | | Cache       | | Cache         |
    | DB Primary    | | DB Primary  | | DB Primary    |
    |  (US users)   | |  (EU users) | |  (APAC users) |
    | DB Replica    | | DB Replica  | | DB Replica    |
    |  (EU, APAC)   | |  (US, APAC) | |  (US, EU)     |
    +-------+-------+ +------+------+ +-------+------+
            |                |                 |
            +----------------+-----------------+
                   Cross-region async replication
                   (replicas are read-only)

    Write path: user's home region (strong consistency)
    Read path:  local region replica (eventual consistency)
    Failover:   promote replica to primary in surviving region

13. Key Trade-offs Discussion

DecisionOption AOption B
TopologyActive-passive (simpler)Active-active (lower latency, costlier)
ReplicationSync (strong, slow)Async (fast, risk data loss)
Conflict resolutionLWW (simple, data loss)CRDTs or app-level (complex, no data loss)
Data placementFull replication (simple)Partitioned by region (compliant, complex routing)
FailoverDNS-based (slow, simple)Global LB / anycast (fast, expensive)
ConsistencyStrong everywhere (slow writes)Eventual (fast, conflicts)

14. Interview Checklist

  • Explained motivations: latency, availability, compliance
  • Compared active-passive vs active-active with clear trade-offs
  • Discussed replication: sync vs async with cross-region latency numbers
  • Covered conflict resolution strategies (LWW, CRDT, conflict-free-by-design)
  • Designed global load balancing (GeoDNS, anycast, global LB)
  • Addressed data sovereignty (GDPR, data classification)
  • Planned for regional failure (capacity headroom, failover timeline)
  • Discussed consistency spectrum (Spanner to DynamoDB)
  • Mentioned cost considerations and optimization strategies
  • Referenced real-world systems (Spanner, CockroachDB, DynamoDB Global Tables)

15. Resources

  • Book: "Designing Data-Intensive Applications" (Kleppmann) -- Chapters 5, 9
  • Google Spanner Paper -- "Spanner: Google's Globally Distributed Database" (OSDI 2012)
  • CockroachDB Architecture Docs -- cockroachlabs.com/docs/architecture
  • AWS Multi-Region Architectures -- AWS Well-Architected whitepaper
  • Netflix Tech Blog -- "Active-Active for Multi-Regional Resiliency"
  • YouTube: Martin Kleppmann -- "Distributed Systems" lecture (Cambridge)
  • Paper: "TAO: Facebook's Distributed Data Store for the Social Graph" (ATC 2013)

Previous: 44 - Gossip Protocols & Membership | Next: 46 - ML System Design Basics