Scaling Strategies
Why This Matters
"How does this scale to 10x/100x traffic?" is asked in every FAANG system design round. You need a toolbox of scaling strategies and know when to apply each.
Vertical vs Horizontal Scaling
Vertical Scaling (Scale Up)
- Bigger machine: more CPU, RAM, SSD, network
- Pros: Simple, no code changes, no distributed complexity
- Cons: Hardware limits, single point of failure, expensive at top end
- When: Small-medium scale, quick wins, databases (before sharding)
Horizontal Scaling (Scale Out)
- More machines, distribute load
- Pros: Theoretically unlimited, cheaper commodity hardware, fault tolerant
- Cons: Distributed complexity, data partitioning, network overhead
- When: Large scale, need fault tolerance, stateless services
The Right Answer in Interviews
"Start with vertical scaling for simplicity. Move to horizontal when you hit limits or need redundancy." Show you know both, prefer simplicity, but know when to distribute.
Stateless Services
Why Stateless?
Stateless services are the foundation of horizontal scaling.
Stateful (hard to scale):
Server A has user session → load balancer MUST send user to Server A
Stateless (easy to scale):
Any server can handle any request → add more servers freely
Making Services Stateless
| Move this... | To here... |
|---|---|
| Session data | Redis / external session store |
| File uploads | S3 / object storage |
| User state | Database |
| Cache state | Redis / Memcached |
| Configuration | Config service / env vars |
Stateless Design Pattern
Client → Load Balancer → Any App Server → Shared State (Redis/DB)
(round robin) (interchangeable)
Read vs Write Scaling
Scaling Reads
┌→ Read Replica 1
Client → Cache → LB ├→ Read Replica 2
(Redis) └→ Read Replica 3
↓
CDN (static)
Techniques:
- Caching — Redis, CDN, browser cache
- Read replicas — replicate DB, route reads to replicas
- CDN — cache static/semi-static content at edge
- Denormalization — pre-compute and store query results
- Materialized views — DB-level pre-computed aggregations
- Search index — Elasticsearch for complex queries
- CQRS — separate read model optimized for queries
Scaling Writes
Client → Queue → Workers → Sharded DB
(buffer) (distributed writes)
Techniques:
- Sharding — partition data across multiple DB instances
- Async processing — queue writes, process in background
- Batch writes — accumulate and write in bulk
- Write-behind cache — write to cache, async flush to DB
- Event sourcing — append-only log (no updates, only inserts)
- Separate write path — CQRS with dedicated write model
Auto-Scaling
Metrics to Scale On
| Metric | Scale When | Good For |
|---|---|---|
| CPU utilization | > 70% | Compute-bound services |
| Memory utilization | > 80% | Memory-bound services |
| Request queue depth | > N pending | IO-bound, queue-based |
| Request latency (p99) | > threshold | Latency-sensitive |
| Custom (business metric) | Varies | Queue size, active users |
Scaling Policies
- Target tracking: Maintain CPU at 50% → add/remove instances
- Step scaling: If CPU > 70% add 2, if > 90% add 5
- Scheduled scaling: Scale up before known peak (Black Friday)
- Predictive: ML-based prediction of upcoming load
Cooldown Periods
- After scaling up, wait before scaling down (avoid flapping)
- Typical: 5-10 min cooldown
- Scale up fast, scale down slowly
Database Scaling Path
Level 1: Single DB (vertical scaling)
↓ hits limits
Level 2: Read replicas (scale reads)
↓ write bottleneck
Level 3: Caching layer (reduce DB load)
↓ still not enough
Level 4: Sharding (scale reads + writes + storage)
↓ need different access patterns
Level 5: Polyglot persistence (different DBs for different needs)
Microservices as a Scaling Strategy
Scaling Individual Services
User Service: 10 instances (high traffic)
Payment Service: 3 instances (moderate traffic)
Report Service: 1 instance (low traffic, can scale up for batch jobs)
Each service scales independently based on its load.
Decomposition Strategies
- By business domain (DDD bounded contexts)
- By data ownership (each service owns its data)
- By scaling needs (separate CPU-bound from IO-bound)
- By team ownership (two-pizza teams)
Common Scaling Bottlenecks
| Bottleneck | Symptom | Solution |
|---|---|---|
| Database | High query latency, connection limits | Caching, replicas, sharding |
| Network | Bandwidth saturation, high latency | CDN, compression, protocol optimization |
| CPU | High CPU, slow processing | Horizontal scaling, optimize algorithms |
| Memory | OOM, swapping | Right-size instances, offload state |
| Disk I/O | High IOWAIT | SSDs, caching, reduce writes |
| External APIs | Rate limited, slow responses | Caching, circuit breaker, async |
| Single writer | One node handles all writes | Sharding, partitioning |
Scaling Patterns Summary
| Pattern | What It Scales | Complexity |
|---|---|---|
| Vertical scaling | Everything (temporarily) | Low |
| Load balancing | Requests across servers | Low |
| Read replicas | Database reads | Low |
| Caching (Redis/CDN) | Reads, latency | Low-Medium |
| Async processing | Writes, throughput | Medium |
| Sharding | Reads + Writes + Storage | High |
| CQRS | Reads + Writes independently | High |
| Microservices | Services independently | High |
| Event-driven | Decoupled throughput | High |
Resources
- 📖 DDIA Chapter 1: Reliable, Scalable, Maintainable
- 📖 "The Art of Scalability" by Abbott & Fisher
- 🔗 AWS Well-Architected: Performance Pillar
- 🎥 Gaurav Sen — System Design playlist
- 🔗 highscalability.com — Scaling case studies
Previous: 14 - Clocks & Ordering | Next: 16 - Microservices Architecture