Scaling Strategies

Why This Matters

"How does this scale to 10x/100x traffic?" is asked in every FAANG system design round. You need a toolbox of scaling strategies and know when to apply each.


Vertical vs Horizontal Scaling

Vertical Scaling (Scale Up)

  • Bigger machine: more CPU, RAM, SSD, network
  • Pros: Simple, no code changes, no distributed complexity
  • Cons: Hardware limits, single point of failure, expensive at top end
  • When: Small-medium scale, quick wins, databases (before sharding)

Horizontal Scaling (Scale Out)

  • More machines, distribute load
  • Pros: Theoretically unlimited, cheaper commodity hardware, fault tolerant
  • Cons: Distributed complexity, data partitioning, network overhead
  • When: Large scale, need fault tolerance, stateless services

The Right Answer in Interviews

"Start with vertical scaling for simplicity. Move to horizontal when you hit limits or need redundancy." Show you know both, prefer simplicity, but know when to distribute.


Stateless Services

Why Stateless?

Stateless services are the foundation of horizontal scaling.

Stateful (hard to scale):
  Server A has user session → load balancer MUST send user to Server A

Stateless (easy to scale):
  Any server can handle any request → add more servers freely

Making Services Stateless

Move this...To here...
Session dataRedis / external session store
File uploadsS3 / object storage
User stateDatabase
Cache stateRedis / Memcached
ConfigurationConfig service / env vars

Stateless Design Pattern

Client → Load Balancer → Any App Server → Shared State (Redis/DB)
         (round robin)    (interchangeable)

Read vs Write Scaling

Scaling Reads

                    ┌→ Read Replica 1
Client → Cache → LB ├→ Read Replica 2
         (Redis)     └→ Read Replica 3
              ↓
           CDN (static)

Techniques:

  1. Caching — Redis, CDN, browser cache
  2. Read replicas — replicate DB, route reads to replicas
  3. CDN — cache static/semi-static content at edge
  4. Denormalization — pre-compute and store query results
  5. Materialized views — DB-level pre-computed aggregations
  6. Search index — Elasticsearch for complex queries
  7. CQRS — separate read model optimized for queries

Scaling Writes

Client → Queue → Workers → Sharded DB
         (buffer)          (distributed writes)

Techniques:

  1. Sharding — partition data across multiple DB instances
  2. Async processing — queue writes, process in background
  3. Batch writes — accumulate and write in bulk
  4. Write-behind cache — write to cache, async flush to DB
  5. Event sourcing — append-only log (no updates, only inserts)
  6. Separate write path — CQRS with dedicated write model

Auto-Scaling

Metrics to Scale On

MetricScale WhenGood For
CPU utilization> 70%Compute-bound services
Memory utilization> 80%Memory-bound services
Request queue depth> N pendingIO-bound, queue-based
Request latency (p99)> thresholdLatency-sensitive
Custom (business metric)VariesQueue size, active users

Scaling Policies

  • Target tracking: Maintain CPU at 50% → add/remove instances
  • Step scaling: If CPU > 70% add 2, if > 90% add 5
  • Scheduled scaling: Scale up before known peak (Black Friday)
  • Predictive: ML-based prediction of upcoming load

Cooldown Periods

  • After scaling up, wait before scaling down (avoid flapping)
  • Typical: 5-10 min cooldown
  • Scale up fast, scale down slowly

Database Scaling Path

Level 1: Single DB (vertical scaling)
  ↓ hits limits
Level 2: Read replicas (scale reads)
  ↓ write bottleneck
Level 3: Caching layer (reduce DB load)
  ↓ still not enough
Level 4: Sharding (scale reads + writes + storage)
  ↓ need different access patterns
Level 5: Polyglot persistence (different DBs for different needs)

Microservices as a Scaling Strategy

Scaling Individual Services

User Service: 10 instances (high traffic)
Payment Service: 3 instances (moderate traffic)
Report Service: 1 instance (low traffic, can scale up for batch jobs)

Each service scales independently based on its load.

Decomposition Strategies

  • By business domain (DDD bounded contexts)
  • By data ownership (each service owns its data)
  • By scaling needs (separate CPU-bound from IO-bound)
  • By team ownership (two-pizza teams)

Common Scaling Bottlenecks

BottleneckSymptomSolution
DatabaseHigh query latency, connection limitsCaching, replicas, sharding
NetworkBandwidth saturation, high latencyCDN, compression, protocol optimization
CPUHigh CPU, slow processingHorizontal scaling, optimize algorithms
MemoryOOM, swappingRight-size instances, offload state
Disk I/OHigh IOWAITSSDs, caching, reduce writes
External APIsRate limited, slow responsesCaching, circuit breaker, async
Single writerOne node handles all writesSharding, partitioning

Scaling Patterns Summary

PatternWhat It ScalesComplexity
Vertical scalingEverything (temporarily)Low
Load balancingRequests across serversLow
Read replicasDatabase readsLow
Caching (Redis/CDN)Reads, latencyLow-Medium
Async processingWrites, throughputMedium
ShardingReads + Writes + StorageHigh
CQRSReads + Writes independentlyHigh
MicroservicesServices independentlyHigh
Event-drivenDecoupled throughputHigh

Resources


Previous: 14 - Clocks & Ordering | Next: 16 - Microservices Architecture