17 - Rate Limiting & Throttling

Previous: 16 - Microservices Architecture | Next: 18 - Unique ID Generation

Why Rate Limiting Matters

Rate limiting protects services from abuse, ensures fair resource allocation, and prevents cascading failures. Every production system at scale needs it.

Key motivations:

Prevent denial-of-service (intentional or accidental)
Enforce usage quotas and billing tiers
Protect downstream dependencies from overload
Ensure fair access across tenants in multi-tenant systems
Smooth out traffic spikes to maintain predictable latency

Core Algorithms

1. Token Bucket

The most widely deployed algorithm (used by AWS, Stripe, and most API gateways).

Bucket capacity: B tokens
Refill rate: R tokens/sec

On request:
  if tokens >= 1:
    tokens -= 1
    ALLOW
  else:
    REJECT (429)

  +-----------+
  |  Bucket   |  Capacity = 10
  |  [////  ] |  Current = 6 tokens
  +-----------+
       ^
       | +R tokens/sec (refill)

Property	Value
Burst handling	Yes -- allows short bursts up to bucket capacity
Memory	O(1) per user -- just store count + last refill time
Precision	Good
Starvation	No -- tokens refill continuously

Implementation sketch (Redis + Lua):

lua
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now

local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)

if tokens >= 1 then
  tokens = tokens - 1
  redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
  redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) * 2)
  return 1  -- ALLOW
else
  redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
  return 0  -- REJECT
end

2. Leaky Bucket

Requests enter a FIFO queue and are processed at a fixed rate. Excess requests overflow.

  Incoming -->  [Queue (size B)]  --> Process at rate R
                     |
                 overflow (drop)

Property	Value
Burst handling	No -- smooths all traffic to constant rate
Memory	O(B) for the queue
Use case	Traffic shaping, smoothing bursty workloads

Trade-off vs Token Bucket: Leaky bucket provides a perfectly smooth output rate but penalizes legitimate burst traffic. Token bucket is more forgiving.

3. Fixed Window Counter

Divide time into fixed windows (e.g., 1-minute intervals). Count requests per window.

  Window 1        Window 2        Window 3
  [12:00-12:01]   [12:01-12:02]   [12:02-12:03]
  Count: 97/100   Count: 45/100   Count: 0/100

Property	Value
Memory	O(1) per user
Precision	Low -- boundary problem
Simplicity	Highest

Boundary problem: A user can send 100 requests at 12:00:59 and 100 more at 12:01:01, getting 200 requests in 2 seconds despite a 100/min limit.

4. Sliding Window Log

Store the timestamp of every request. Count requests within the trailing window.

Window = 60s, Limit = 100

Timestamps: [t-55, t-42, t-30, t-18, t-5, t-2, t]
                                                ^-- new request
Count timestamps where ts > (now - 60s)

Property	Value
Memory	O(N) per user -- stores every timestamp
Precision	Exact
Cost	High memory at scale

5. Sliding Window Counter (Hybrid)

Combines fixed window counting with weighted interpolation. Best balance of accuracy and efficiency.

Previous window count: 84  (weight: 0.3 of window elapsed)
Current window count:  25

Weighted count = 84 * 0.7 + 25 = 83.8
If 83.8 >= 100 --> REJECT

Property	Value
Memory	O(1) per user -- two counters
Precision	Very good (not exact, but close)
Use case	Production systems needing accuracy + efficiency

Algorithm Comparison

Algorithm	Memory	Accuracy	Burst	Complexity	Production Use
Token Bucket	O(1)	Good	Yes	Low	AWS, Stripe, most gateways
Leaky Bucket	O(B)	Good	No	Medium	Traffic shaping
Fixed Window	O(1)	Low	Boundary bug	Lowest	Simple internal services
Sliding Log	O(N)	Exact	N/A	Medium	When precision is critical
Sliding Counter	O(1)	Very Good	Partial	Low	Cloudflare, production APIs

Distributed Rate Limiting

Single-node rate limiting breaks when you scale horizontally. Two main approaches:

Centralized (Redis-based)

  Client --> LB --> [Server 1] --+
                   [Server 2] --+--> Redis (shared state)
                   [Server 3] --+

Pros: Globally consistent, accurate counts. Cons: Redis becomes a dependency; adds ~1ms latency per request.

Redis implementation patterns:

INCR + EXPIRE for fixed window
Lua scripts for token bucket (atomic operations)
Sorted sets for sliding window log
Redis Cluster for HA

Local + Sync (Approximate)

Each node maintains local counters and periodically syncs with peers.

  [Server 1: local=30] --sync--> [Server 2: local=25]
        \                              /
         +---> Gossip / broadcast <---+

Pros: No single point of failure, low latency. Cons: Approximate -- can briefly exceed limits during sync gaps.

When to use which:

Centralized: Billing-critical limits, strict quotas, API monetization
Local + Sync: Best-effort protection, DDoS mitigation, internal services

Race Condition Handling

-- WRONG (race condition):
count = GET(key)
if count < limit:
  SET(key, count + 1)  -- Another thread may have incremented

-- RIGHT (atomic):
count = INCR(key)
if count == 1:
  EXPIRE(key, window_size)
if count > limit:
  REJECT

Always use atomic operations (Lua scripts or Redis transactions).

Rate Limiting at Different Layers

  Client  -->  CDN/Edge  -->  API Gateway  -->  Application  -->  Database
    |              |               |                |               |
  Client-side   Edge rate      Gateway rate     App-level       Connection
  throttling    limiting       limiting         throttling      pooling

Layer	Tool	Granularity	Use Case
Client	Exponential backoff	Per client	Prevent self-DoS
CDN/Edge	Cloudflare, AWS WAF	IP, geo	DDoS, bot protection
API Gateway	Kong, Envoy, NGINX	API key, route	Quota enforcement
Application	Custom middleware	User, tenant, endpoint	Business rules
Database	Connection pool limits	Per service	Protect DB

HTTP Headers and Status Codes

Standard Response Headers

http
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1672531260

{
  "error": "rate_limit_exceeded",
  "message": "Rate limit of 100 requests per minute exceeded.",
  "retry_after": 30
}

Header	Purpose
`429 Too Many Requests`	HTTP status code for rate limiting
`Retry-After`	Seconds (or date) until client should retry
`X-RateLimit-Limit`	Maximum requests allowed in window
`X-RateLimit-Remaining`	Requests remaining in current window
`X-RateLimit-Reset`	Unix timestamp when window resets

Client-Side Best Practices

python
import time
import random

def request_with_backoff(url, max_retries=5):
    for attempt in range(max_retries):
        response = http_get(url)
        if response.status != 429:
            return response
        
        retry_after = int(response.headers.get('Retry-After', 1))
        jitter = random.uniform(0, 0.5)
        wait = retry_after + jitter
        time.sleep(wait)
    
    raise Exception("Max retries exceeded")

Always add jitter to avoid thundering herd when many clients retry simultaneously.

Interview Design Problem: Rate Limiter

Requirements Gathering

Ask these questions:

Scale: How many users? How many requests/sec globally?
Granularity: Per user? Per IP? Per API key? Per endpoint?
Rules: Single rule or tiered (free/pro/enterprise)?
Accuracy: Strict (billing) or approximate (protection)?
Latency: What overhead is acceptable?
Distributed: Single DC or multi-region?

High-Level Design

                    +------------------+
                    |   Rules Engine   |
                    | (config store)   |
                    +--------+---------+
                             |
  Client --> API Gateway --> Rate Limiter Middleware --> Service
                             |
                    +--------+---------+
                    |   Redis Cluster  |
                    | (counter store)  |
                    +------------------+

Detailed Design Decisions

Decision	Choice	Rationale
Algorithm	Sliding window counter	Best accuracy/memory trade-off
Storage	Redis Cluster	Sub-ms latency, atomic ops, HA
Rules storage	Config service or DB	Rules change infrequently
Key format	`rate:{user_id}:{endpoint}:{window}`	Flexible granularity
Failure mode	Allow (fail open)	Don't block users if Redis is down

Handling Edge Cases

Multi-region: Each region can have local Redis + periodic cross-region sync
Redis failure: Fail open with local in-memory fallback (degrade gracefully)
Clock skew: Use Redis server time (TIME command), not client time
Hot keys: Shard by user ID; for extremely hot keys, use local caching

Real-World Examples

Company	Approach	Details
GitHub API	Token bucket	5,000 req/hr per authenticated user
Stripe	Token bucket + sliding window	Per-API-key with burst allowance
Cloudflare	Sliding window counter	Edge-based, per IP/route
Twitter/X	Fixed window	15-min windows, per endpoint
Discord	Token bucket	Per route, per user, global

Interview Tips

Start with requirements: Don't jump to token bucket. Ask about accuracy needs, scale, and failure modes.
Discuss trade-offs: No algorithm is universally best. Show you understand when each shines.
Address distributed concerns: Single-node rate limiting is trivial. The hard part is consistency across nodes.
Mention fail-open vs fail-closed: This shows operational maturity.
Talk about multi-tenancy: Different rate limits for different tiers is a common follow-up.

Resources

DDIA (Kleppmann) -- Chapter 8: The Trouble with Distributed Systems
System Design Interview Vol. 1 (Alex Xu) -- Chapter 4: Design a Rate Limiter
Stripe Engineering Blog: "Rate Limiters and Load Shedders"
Cloudflare Blog: "How We Built Rate Limiting"
IETF RFC 6585 -- HTTP 429 Status Code
Google Cloud Architecture: "Rate-limiting strategies and techniques"
Token Bucket on Wikipedia for algorithm formalization

Previous: 16 - Microservices Architecture | Next: 18 - Unique ID Generation