17 - Rate Limiting & Throttling
Previous: 16 - Microservices Architecture | Next: 18 - Unique ID Generation
Why Rate Limiting Matters
Rate limiting protects services from abuse, ensures fair resource allocation, and prevents cascading failures. Every production system at scale needs it.
Key motivations:
- Prevent denial-of-service (intentional or accidental)
- Enforce usage quotas and billing tiers
- Protect downstream dependencies from overload
- Ensure fair access across tenants in multi-tenant systems
- Smooth out traffic spikes to maintain predictable latency
Core Algorithms
1. Token Bucket
The most widely deployed algorithm (used by AWS, Stripe, and most API gateways).
Bucket capacity: B tokens
Refill rate: R tokens/sec
On request:
if tokens >= 1:
tokens -= 1
ALLOW
else:
REJECT (429)
+-----------+
| Bucket | Capacity = 10
| [//// ] | Current = 6 tokens
+-----------+
^
| +R tokens/sec (refill)
| Property | Value |
|---|---|
| Burst handling | Yes -- allows short bursts up to bucket capacity |
| Memory | O(1) per user -- just store count + last refill time |
| Precision | Good |
| Starvation | No -- tokens refill continuously |
Implementation sketch (Redis + Lua):
lualocal key = KEYS[1] local capacity = tonumber(ARGV[1]) local refill_rate = tonumber(ARGV[2]) local now = tonumber(ARGV[3]) local data = redis.call('HMGET', key, 'tokens', 'last_refill') local tokens = tonumber(data[1]) or capacity local last_refill = tonumber(data[2]) or now local elapsed = now - last_refill tokens = math.min(capacity, tokens + elapsed * refill_rate) if tokens >= 1 then tokens = tokens - 1 redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now) redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) * 2) return 1 -- ALLOW else redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now) return 0 -- REJECT end
2. Leaky Bucket
Requests enter a FIFO queue and are processed at a fixed rate. Excess requests overflow.
Incoming --> [Queue (size B)] --> Process at rate R
|
overflow (drop)
| Property | Value |
|---|---|
| Burst handling | No -- smooths all traffic to constant rate |
| Memory | O(B) for the queue |
| Use case | Traffic shaping, smoothing bursty workloads |
Trade-off vs Token Bucket: Leaky bucket provides a perfectly smooth output rate but penalizes legitimate burst traffic. Token bucket is more forgiving.
3. Fixed Window Counter
Divide time into fixed windows (e.g., 1-minute intervals). Count requests per window.
Window 1 Window 2 Window 3
[12:00-12:01] [12:01-12:02] [12:02-12:03]
Count: 97/100 Count: 45/100 Count: 0/100
| Property | Value |
|---|---|
| Memory | O(1) per user |
| Precision | Low -- boundary problem |
| Simplicity | Highest |
Boundary problem: A user can send 100 requests at 12:00:59 and 100 more at 12:01:01, getting 200 requests in 2 seconds despite a 100/min limit.
4. Sliding Window Log
Store the timestamp of every request. Count requests within the trailing window.
Window = 60s, Limit = 100
Timestamps: [t-55, t-42, t-30, t-18, t-5, t-2, t]
^-- new request
Count timestamps where ts > (now - 60s)
| Property | Value |
|---|---|
| Memory | O(N) per user -- stores every timestamp |
| Precision | Exact |
| Cost | High memory at scale |
5. Sliding Window Counter (Hybrid)
Combines fixed window counting with weighted interpolation. Best balance of accuracy and efficiency.
Previous window count: 84 (weight: 0.3 of window elapsed)
Current window count: 25
Weighted count = 84 * 0.7 + 25 = 83.8
If 83.8 >= 100 --> REJECT
| Property | Value |
|---|---|
| Memory | O(1) per user -- two counters |
| Precision | Very good (not exact, but close) |
| Use case | Production systems needing accuracy + efficiency |
Algorithm Comparison
| Algorithm | Memory | Accuracy | Burst | Complexity | Production Use |
|---|---|---|---|---|---|
| Token Bucket | O(1) | Good | Yes | Low | AWS, Stripe, most gateways |
| Leaky Bucket | O(B) | Good | No | Medium | Traffic shaping |
| Fixed Window | O(1) | Low | Boundary bug | Lowest | Simple internal services |
| Sliding Log | O(N) | Exact | N/A | Medium | When precision is critical |
| Sliding Counter | O(1) | Very Good | Partial | Low | Cloudflare, production APIs |
Distributed Rate Limiting
Single-node rate limiting breaks when you scale horizontally. Two main approaches:
Centralized (Redis-based)
Client --> LB --> [Server 1] --+
[Server 2] --+--> Redis (shared state)
[Server 3] --+
Pros: Globally consistent, accurate counts. Cons: Redis becomes a dependency; adds ~1ms latency per request.
Redis implementation patterns:
INCR+EXPIREfor fixed window- Lua scripts for token bucket (atomic operations)
- Sorted sets for sliding window log
- Redis Cluster for HA
Local + Sync (Approximate)
Each node maintains local counters and periodically syncs with peers.
[Server 1: local=30] --sync--> [Server 2: local=25]
\ /
+---> Gossip / broadcast <---+
Pros: No single point of failure, low latency. Cons: Approximate -- can briefly exceed limits during sync gaps.
When to use which:
- Centralized: Billing-critical limits, strict quotas, API monetization
- Local + Sync: Best-effort protection, DDoS mitigation, internal services
Race Condition Handling
-- WRONG (race condition):
count = GET(key)
if count < limit:
SET(key, count + 1) -- Another thread may have incremented
-- RIGHT (atomic):
count = INCR(key)
if count == 1:
EXPIRE(key, window_size)
if count > limit:
REJECT
Always use atomic operations (Lua scripts or Redis transactions).
Rate Limiting at Different Layers
Client --> CDN/Edge --> API Gateway --> Application --> Database
| | | | |
Client-side Edge rate Gateway rate App-level Connection
throttling limiting limiting throttling pooling
| Layer | Tool | Granularity | Use Case |
|---|---|---|---|
| Client | Exponential backoff | Per client | Prevent self-DoS |
| CDN/Edge | Cloudflare, AWS WAF | IP, geo | DDoS, bot protection |
| API Gateway | Kong, Envoy, NGINX | API key, route | Quota enforcement |
| Application | Custom middleware | User, tenant, endpoint | Business rules |
| Database | Connection pool limits | Per service | Protect DB |
HTTP Headers and Status Codes
Standard Response Headers
httpHTTP/1.1 429 Too Many Requests Retry-After: 30 X-RateLimit-Limit: 100 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1672531260 { "error": "rate_limit_exceeded", "message": "Rate limit of 100 requests per minute exceeded.", "retry_after": 30 }
| Header | Purpose |
|---|---|
429 Too Many Requests | HTTP status code for rate limiting |
Retry-After | Seconds (or date) until client should retry |
X-RateLimit-Limit | Maximum requests allowed in window |
X-RateLimit-Remaining | Requests remaining in current window |
X-RateLimit-Reset | Unix timestamp when window resets |
Client-Side Best Practices
pythonimport time import random def request_with_backoff(url, max_retries=5): for attempt in range(max_retries): response = http_get(url) if response.status != 429: return response retry_after = int(response.headers.get('Retry-After', 1)) jitter = random.uniform(0, 0.5) wait = retry_after + jitter time.sleep(wait) raise Exception("Max retries exceeded")
Always add jitter to avoid thundering herd when many clients retry simultaneously.
Interview Design Problem: Rate Limiter
Requirements Gathering
Ask these questions:
- Scale: How many users? How many requests/sec globally?
- Granularity: Per user? Per IP? Per API key? Per endpoint?
- Rules: Single rule or tiered (free/pro/enterprise)?
- Accuracy: Strict (billing) or approximate (protection)?
- Latency: What overhead is acceptable?
- Distributed: Single DC or multi-region?
High-Level Design
+------------------+
| Rules Engine |
| (config store) |
+--------+---------+
|
Client --> API Gateway --> Rate Limiter Middleware --> Service
|
+--------+---------+
| Redis Cluster |
| (counter store) |
+------------------+
Detailed Design Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Algorithm | Sliding window counter | Best accuracy/memory trade-off |
| Storage | Redis Cluster | Sub-ms latency, atomic ops, HA |
| Rules storage | Config service or DB | Rules change infrequently |
| Key format | rate:{user_id}:{endpoint}:{window} | Flexible granularity |
| Failure mode | Allow (fail open) | Don't block users if Redis is down |
Handling Edge Cases
- Multi-region: Each region can have local Redis + periodic cross-region sync
- Redis failure: Fail open with local in-memory fallback (degrade gracefully)
- Clock skew: Use Redis server time (
TIMEcommand), not client time - Hot keys: Shard by user ID; for extremely hot keys, use local caching
Real-World Examples
| Company | Approach | Details |
|---|---|---|
| GitHub API | Token bucket | 5,000 req/hr per authenticated user |
| Stripe | Token bucket + sliding window | Per-API-key with burst allowance |
| Cloudflare | Sliding window counter | Edge-based, per IP/route |
| Twitter/X | Fixed window | 15-min windows, per endpoint |
| Discord | Token bucket | Per route, per user, global |
Interview Tips
- Start with requirements: Don't jump to token bucket. Ask about accuracy needs, scale, and failure modes.
- Discuss trade-offs: No algorithm is universally best. Show you understand when each shines.
- Address distributed concerns: Single-node rate limiting is trivial. The hard part is consistency across nodes.
- Mention fail-open vs fail-closed: This shows operational maturity.
- Talk about multi-tenancy: Different rate limits for different tiers is a common follow-up.
Resources
- DDIA (Kleppmann) -- Chapter 8: The Trouble with Distributed Systems
- System Design Interview Vol. 1 (Alex Xu) -- Chapter 4: Design a Rate Limiter
- Stripe Engineering Blog: "Rate Limiters and Load Shedders"
- Cloudflare Blog: "How We Built Rate Limiting"
- IETF RFC 6585 -- HTTP 429 Status Code
- Google Cloud Architecture: "Rate-limiting strategies and techniques"
- Token Bucket on Wikipedia for algorithm formalization
Previous: 16 - Microservices Architecture | Next: 18 - Unique ID Generation