Scaling Strategies

Why This Matters

"How does this scale to 10x/100x traffic?" is asked in every FAANG system design round. You need a toolbox of scaling strategies and know when to apply each.

Vertical vs Horizontal Scaling

Vertical Scaling (Scale Up)

Bigger machine: more CPU, RAM, SSD, network
Pros: Simple, no code changes, no distributed complexity
Cons: Hardware limits, single point of failure, expensive at top end
When: Small-medium scale, quick wins, databases (before sharding)

Horizontal Scaling (Scale Out)

More machines, distribute load
Pros: Theoretically unlimited, cheaper commodity hardware, fault tolerant
Cons: Distributed complexity, data partitioning, network overhead
When: Large scale, need fault tolerance, stateless services

The Right Answer in Interviews

"Start with vertical scaling for simplicity. Move to horizontal when you hit limits or need redundancy." Show you know both, prefer simplicity, but know when to distribute.

Stateless Services

Why Stateless?

Stateless services are the foundation of horizontal scaling.

Stateful (hard to scale):
  Server A has user session → load balancer MUST send user to Server A

Stateless (easy to scale):
  Any server can handle any request → add more servers freely

Making Services Stateless

Move this...	To here...
Session data	Redis / external session store
File uploads	S3 / object storage
User state	Database
Cache state	Redis / Memcached
Configuration	Config service / env vars

Stateless Design Pattern

Client → Load Balancer → Any App Server → Shared State (Redis/DB)
         (round robin)    (interchangeable)

Read vs Write Scaling

Scaling Reads

                    ┌→ Read Replica 1
Client → Cache → LB ├→ Read Replica 2
         (Redis)     └→ Read Replica 3
              ↓
           CDN (static)

Techniques:

Caching — Redis, CDN, browser cache
Read replicas — replicate DB, route reads to replicas
CDN — cache static/semi-static content at edge
Denormalization — pre-compute and store query results
Materialized views — DB-level pre-computed aggregations
Search index — Elasticsearch for complex queries
CQRS — separate read model optimized for queries

Scaling Writes

Client → Queue → Workers → Sharded DB
         (buffer)          (distributed writes)

Techniques:

Sharding — partition data across multiple DB instances
Async processing — queue writes, process in background
Batch writes — accumulate and write in bulk
Write-behind cache — write to cache, async flush to DB
Event sourcing — append-only log (no updates, only inserts)
Separate write path — CQRS with dedicated write model

Auto-Scaling

Metrics to Scale On

Metric	Scale When	Good For
CPU utilization	> 70%	Compute-bound services
Memory utilization	> 80%	Memory-bound services
Request queue depth	> N pending	IO-bound, queue-based
Request latency (p99)	> threshold	Latency-sensitive
Custom (business metric)	Varies	Queue size, active users

Scaling Policies

Target tracking: Maintain CPU at 50% → add/remove instances
Step scaling: If CPU > 70% add 2, if > 90% add 5
Scheduled scaling: Scale up before known peak (Black Friday)
Predictive: ML-based prediction of upcoming load

Cooldown Periods

After scaling up, wait before scaling down (avoid flapping)
Typical: 5-10 min cooldown
Scale up fast, scale down slowly

Database Scaling Path

Level 1: Single DB (vertical scaling)
  ↓ hits limits
Level 2: Read replicas (scale reads)
  ↓ write bottleneck
Level 3: Caching layer (reduce DB load)
  ↓ still not enough
Level 4: Sharding (scale reads + writes + storage)
  ↓ need different access patterns
Level 5: Polyglot persistence (different DBs for different needs)

Microservices as a Scaling Strategy

Scaling Individual Services

User Service: 10 instances (high traffic)
Payment Service: 3 instances (moderate traffic)
Report Service: 1 instance (low traffic, can scale up for batch jobs)

Each service scales independently based on its load.

Decomposition Strategies

By business domain (DDD bounded contexts)
By data ownership (each service owns its data)
By scaling needs (separate CPU-bound from IO-bound)
By team ownership (two-pizza teams)

Common Scaling Bottlenecks

Bottleneck	Symptom	Solution
Database	High query latency, connection limits	Caching, replicas, sharding
Network	Bandwidth saturation, high latency	CDN, compression, protocol optimization
CPU	High CPU, slow processing	Horizontal scaling, optimize algorithms
Memory	OOM, swapping	Right-size instances, offload state
Disk I/O	High IOWAIT	SSDs, caching, reduce writes
External APIs	Rate limited, slow responses	Caching, circuit breaker, async
Single writer	One node handles all writes	Sharding, partitioning

Scaling Patterns Summary

Pattern	What It Scales	Complexity
Vertical scaling	Everything (temporarily)	Low
Load balancing	Requests across servers	Low
Read replicas	Database reads	Low
Caching (Redis/CDN)	Reads, latency	Low-Medium
Async processing	Writes, throughput	Medium
Sharding	Reads + Writes + Storage	High
CQRS	Reads + Writes independently	High
Microservices	Services independently	High
Event-driven	Decoupled throughput	High

Resources

📖 DDIA Chapter 1: Reliable, Scalable, Maintainable
📖 "The Art of Scalability" by Abbott & Fisher
🔗 AWS Well-Architected: Performance Pillar
🎥 Gaurav Sen — System Design playlist
🔗 highscalability.com — Scaling case studies

Previous: 14 - Clocks & Ordering | Next: 16 - Microservices Architecture