45 - Geo-Distributed Systems

Previous: 44 - Gossip Protocols & Membership | Next: 46 - ML System Design Basics

1. Why Go Multi-Region?

Driver	Explanation
Latency	Users in Tokyo shouldn't wait 200ms for a server in Virginia. Serve from the nearest region.
Availability	A single region outage (AWS us-east-1 goes down) shouldn't take the whole system offline.
Compliance	GDPR requires EU user data to stay in EU. Data sovereignty laws vary by country.
Disaster recovery	Natural disasters, power outages, fiber cuts can take out an entire region.
Scalability	Distribute load across regions rather than vertically scaling a single region.

2. Active-Passive vs Active-Active

Active-Passive (Primary-Secondary)

                         Users
                      /    |    \
              US users  EU users  APAC users
                  |        |         |
                  v        v         v
            +----------+  Global LB routes all writes
            |          |  to primary region
            v          v
    +--------------+  +--------------+
    | US-EAST      |  | EU-WEST      |
    | (PRIMARY)    |  | (STANDBY)    |
    | Read + Write |  | Read only    |
    +---------+----+  +------+-------+
              |              ^
              | async        |
              | replication  |
              +--------------+

Aspect	Detail
Writes	All go to primary region
Reads	Can serve from any region
Failover	Promote standby to primary (minutes of downtime)
Complexity	Lower
Data loss risk	RPO > 0 (async replication lag)

Active-Active (Multi-Primary)

                         Users
                      /    |    \
              US users  EU users  APAC users
                  |        |         |
                  v        v         v
          +-----------+ +-----------+ +-----------+
          | US-EAST   | | EU-WEST   | | APAC-EAST |
          | Read+Write| | Read+Write| | Read+Write|
          +-----+-----+ +-----+-----+ +-----+-----+
                |              |              |
                +--------------+--------------+
                    bi-directional async
                    replication

Aspect	Detail
Writes	Any region can accept writes
Reads	Served locally (lowest latency)
Failover	Other regions absorb traffic (near-zero downtime)
Complexity	High (conflict resolution required)
Consistency	Eventual (conflicts possible)

Comparison

Criterion	Active-Passive	Active-Active
Write latency	High for remote users	Low (local writes)
Failover time	Minutes	Seconds
Conflict resolution	None needed	Required (CRDTs, LWW, app-level)
Cost	Lower (standby underutilized)	Higher (all regions active)
Data consistency	Easier (single writer)	Harder (multi-writer)

Interview Tip

Most companies start active-passive and evolve to active-active only for specific high-traffic, latency-sensitive workloads. Full active-active is expensive and complex. Be ready to justify which approach fits the use case.

3. Data Replication Across Regions

Synchronous vs Asynchronous

Synchronous:
  US-EAST writes data
    |--- replicate to EU-WEST (wait for ACK) --- 80-150ms cross-region
    |--- replicate to APAC (wait for ACK) ------- 150-300ms cross-region
    |--- respond to client
  Total write latency: 150-300ms+ (bottleneck: farthest region)

Asynchronous:
  US-EAST writes data
    |--- respond to client immediately
    |--- replicate to EU-WEST in background
    |--- replicate to APAC in background
  Total write latency: ~5ms (local only)
  Trade-off: data may be lost if region fails before replication

Approach	Write Latency	Consistency	Data Loss Risk
Synchronous	High (cross-region RTT)	Strong	None
Asynchronous	Low (local only)	Eventual	Window of replication lag
Semi-synchronous	Medium (1 remote ACK)	Bounded staleness	Reduced

Cross-Region Latency Reference

Route	Typical RTT
US-East <-> US-West	60-80ms
US-East <-> EU-West	80-120ms
US-East <-> APAC	150-250ms
EU-West <-> APAC	200-300ms
Same region (AZ to AZ)	1-2ms

4. Conflict Resolution for Multi-Region Writes

When two regions write to the same record concurrently:

US-EAST:  UPDATE user SET name = "Alice"  (timestamp: t1)
EU-WEST:  UPDATE user SET name = "Bob"    (timestamp: t2)

Both succeed locally. On replication, conflict detected.

Strategy	How	Trade-off
Last-Writer-Wins (LWW)	Higher timestamp wins	Simple but data loss (losing write silently discarded)
CRDTs	Merge-friendly data structures	No data loss, limited data type support (see 43 - CRDT & Conflict-Free Replication)
Application-level	App defines custom merge logic	Most flexible, most complex
Conflict-free by design	Partition data so each region owns a subset	No conflicts possible, limits flexibility

Conflict-Free by Design (Recommended Starting Point)

Strategy: Route writes to the "home region" of the data

User created in EU -> EU-WEST is the home region for this user's data
All writes to this user go to EU-WEST
Other regions serve reads from local replica

Routing table:
  user_id % N -> home_region
  OR: user's registration country -> home_region

This avoids multi-writer conflicts entirely.

5. Global Load Balancing

                         DNS Query: api.example.com
                                |
                         +------v------+
                         | GeoDNS      |
                         | (Route 53)  |
                         +------+------+
                                |
              +-----------------+-----------------+
              |                 |                 |
      US user gets:     EU user gets:    APAC user gets:
      52.1.2.3          34.5.6.7         13.8.9.10
      (US-EAST LB)      (EU-WEST LB)    (APAC LB)
              |                 |                 |
      +-------v-------+ +------v------+ +--------v------+
      | Regional LB   | | Regional LB | | Regional LB   |
      | (US-EAST)     | | (EU-WEST)   | | (APAC-EAST)   |
      +---------------+ +-------------+ +---------------+

Routing Techniques

Technique	How	Latency	Failover
GeoDNS	DNS returns IP of nearest region based on client IP	Good	Slow (DNS TTL, 30-300s)
Anycast	Same IP announced from all regions, BGP routes to nearest	Best	Fast (BGP convergence, seconds)
Global LB (L7)	Single entry point, intelligent routing (e.g., Cloudflare, AWS Global Accelerator)	Good	Fast (health-check based)
Client-side routing	Client SDK picks region based on latency probes	Best (measured)	Immediate

6. Data Placement & Sovereignty

GDPR and Data Sovereignty

GDPR Rule: EU personal data must be processed in compliance with EU law.
           In practice: store EU user data in EU regions.

Implementation:

  User registers from Germany
    -> user.home_region = EU-WEST
    -> PII stored only in EU-WEST
    -> Aggregated/anonymized data may be replicated globally

  Data classification:
    PII (name, email, address)        -> restricted to home region
    Non-PII (product views, clicks)   -> can replicate globally
    Aggregated metrics                 -> can replicate globally

Data Type	Can Replicate Globally?	Notes
PII (personal data)	No (unless compliant transfer mechanism)	GDPR, CCPA, LGPD
Financial data	Jurisdiction-dependent	PCI DSS, local finance laws
Healthcare data	No	HIPAA (US), local equivalents
Anonymized/aggregated	Yes	No longer personal data
Application config	Yes	Non-sensitive

7. Designing for Regional Failure

Region US-EAST goes down:

  Before failure:
    US users -> US-EAST
    EU users -> EU-WEST
    APAC users -> APAC-EAST

  After failure:
    US users -> EU-WEST (failover, +80ms latency)
    EU users -> EU-WEST (unchanged)
    APAC users -> APAC-EAST (unchanged)

  Requirements for successful failover:
    1. Health checks detect US-EAST failure (seconds)
    2. DNS/LB routes US traffic to EU-WEST (seconds to minutes)
    3. EU-WEST has capacity to handle US + EU traffic
    4. EU-WEST has sufficiently recent data replica
    5. Write path redirected or queued

Capacity Planning for Failover

Rule of thumb: Each region should have 50% headroom

  Normal load per region:  1000 QPS
  Capacity per region:     1500 QPS  (50% headroom)
  
  When one of 3 regions fails:
    Surviving regions each absorb ~500 extra QPS
    New load: 1500 QPS each (at capacity, but functional)

Alternative: N+1 regions
  3 regions needed for normal traffic
  Deploy 4 regions (1 extra for failover capacity)

8. Traffic Management During Failover

Failover Timeline:

  t=0:    US-EAST health checks fail
  t=5s:   Health check confirms failure (3 consecutive failures)
  t=10s:  Global LB removes US-EAST from rotation
  t=10s:  US traffic routes to EU-WEST (nearest healthy)
  t=10s+: EU-WEST auto-scales to handle increased load

  DNS-based failover:
    Slower: DNS TTL must expire (30-300s)
    Mitigation: Use low TTL (30s) for critical services
    Or: Use anycast/global LB instead of DNS

  Client-side failover:
    Fastest: Client detects failure, switches endpoint
    Requires: Client SDK with failover logic

9. Consistency Trade-offs in Multi-Region

The spectrum:

  Strong Consistency                              Eventual Consistency
  |                                                              |
  Spanner (sync cross-region)    Bounded staleness    DynamoDB Global Tables
  Slow writes, no conflicts      Tunable              Fast writes, conflicts possible

Model	Write Latency	Read Freshness	Example
Strong (synchronous)	100-300ms	Always fresh	Google Spanner
Bounded staleness	50-100ms	Stale by at most X seconds	CockroachDB (follower reads)
Causal consistency	10-50ms	Causally related reads are fresh	MongoDB sessions
Eventual	1-10ms	May read stale data	DynamoDB Global Tables

Interview Tip

Google Spanner achieves strong consistency across regions using TrueTime (GPS + atomic clocks for globally synchronized timestamps). This is hardware-dependent and unique to Google's infrastructure. CockroachDB approximates this with software-based uncertainty intervals. Most systems use eventual consistency because cross-region synchronous writes are too slow.

10. Cost Considerations

Cost Factor	Impact
Data transfer	Cross-region bandwidth: $0.02-0.09/GB (AWS). Replication generates significant transfer costs.
Compute	Running full application stack in 3+ regions: 3x compute cost.
Storage	Full replicas in each region: 3x storage cost.
Database	Multi-region databases (Spanner, CockroachDB) command premium pricing.
Networking	Global load balancers, dedicated interconnects.

Cost Optimization Strategies

1. Replicate only what's needed
   - Full DB replica: expensive
   - Cache popular data in remote regions: cheaper
   - CDN for static content: cheapest

2. Tiered approach
   - Primary region: full stack
   - Secondary regions: read replicas + cache + CDN
   - Failover region: minimal standby, scale on demand

3. Reserved capacity in primary, spot/on-demand in secondary

4. Compress replication traffic (50-80% reduction)

11. Real-World Examples

Google Spanner

- Globally distributed, strongly consistent
- Uses TrueTime (GPS + atomic clocks) for global ordering
- Synchronous replication across regions
- Write latency: 10-100ms (within continent), 100-300ms (cross-continent)
- Used for: Google Ads, Google Play
- Externally consistent: strongest possible guarantee

CockroachDB

- Open-source Spanner-inspired database
- Software-based uncertainty intervals (no special hardware)
- Serializable isolation across regions
- Survival goals: region-level failure tolerance
- Follower reads: bounded-staleness reads from any region

DynamoDB Global Tables

- Active-active multi-region
- Eventual consistency for cross-region reads
- LWW conflict resolution (last writer wins)
- Replication lag: typically < 1 second
- Simple setup: enable Global Tables, select regions
- Trade-off: no strong consistency across regions

Netflix

- Active-active across 3 AWS regions (us-east, us-west, eu-west)
- Each region serves its local traffic
- Regional failure: traffic drains to other regions
- Zuul (API gateway) handles routing
- EVCache (Memcached-based) replicated across regions
- Cassandra for multi-region data with eventual consistency

12. Architecture Pattern: Multi-Region with Regional Data Ownership

                     +-------------------+
                     |  Global DNS /     |
                     |  GeoDNS (Route53) |
                     +--------+----------+
                              |
            +-----------------+-----------------+
            |                 |                 |
    +-------v-------+ +------v------+ +--------v------+
    | US-EAST       | | EU-WEST     | | APAC-EAST     |
    +---------------+ +-------------+ +---------------+
    | API Gateway   | | API Gateway | | API Gateway   |
    | App Servers   | | App Servers | | App Servers   |
    | Cache (Redis) | | Cache       | | Cache         |
    | DB Primary    | | DB Primary  | | DB Primary    |
    |  (US users)   | |  (EU users) | |  (APAC users) |
    | DB Replica    | | DB Replica  | | DB Replica    |
    |  (EU, APAC)   | |  (US, APAC) | |  (US, EU)     |
    +-------+-------+ +------+------+ +-------+------+
            |                |                 |
            +----------------+-----------------+
                   Cross-region async replication
                   (replicas are read-only)

    Write path: user's home region (strong consistency)
    Read path:  local region replica (eventual consistency)
    Failover:   promote replica to primary in surviving region

13. Key Trade-offs Discussion

Decision	Option A	Option B
Topology	Active-passive (simpler)	Active-active (lower latency, costlier)
Replication	Sync (strong, slow)	Async (fast, risk data loss)
Conflict resolution	LWW (simple, data loss)	CRDTs or app-level (complex, no data loss)
Data placement	Full replication (simple)	Partitioned by region (compliant, complex routing)
Failover	DNS-based (slow, simple)	Global LB / anycast (fast, expensive)
Consistency	Strong everywhere (slow writes)	Eventual (fast, conflicts)

14. Interview Checklist

15. Resources

Book: "Designing Data-Intensive Applications" (Kleppmann) -- Chapters 5, 9
Google Spanner Paper -- "Spanner: Google's Globally Distributed Database" (OSDI 2012)
CockroachDB Architecture Docs -- cockroachlabs.com/docs/architecture
AWS Multi-Region Architectures -- AWS Well-Architected whitepaper
Netflix Tech Blog -- "Active-Active for Multi-Regional Resiliency"
YouTube: Martin Kleppmann -- "Distributed Systems" lecture (Cambridge)
Paper: "TAO: Facebook's Distributed Data Store for the Social Graph" (ATC 2013)

Previous: 44 - Gossip Protocols & Membership | Next: 46 - ML System Design Basics