17 - Rate Limiting & Throttling

Previous: 16 - Microservices Architecture | Next: 18 - Unique ID Generation


Why Rate Limiting Matters

Rate limiting protects services from abuse, ensures fair resource allocation, and prevents cascading failures. Every production system at scale needs it.

Key motivations:

  • Prevent denial-of-service (intentional or accidental)
  • Enforce usage quotas and billing tiers
  • Protect downstream dependencies from overload
  • Ensure fair access across tenants in multi-tenant systems
  • Smooth out traffic spikes to maintain predictable latency

Core Algorithms

1. Token Bucket

The most widely deployed algorithm (used by AWS, Stripe, and most API gateways).

Bucket capacity: B tokens
Refill rate: R tokens/sec

On request:
  if tokens >= 1:
    tokens -= 1
    ALLOW
  else:
    REJECT (429)
  +-----------+
  |  Bucket   |  Capacity = 10
  |  [////  ] |  Current = 6 tokens
  +-----------+
       ^
       | +R tokens/sec (refill)
PropertyValue
Burst handlingYes -- allows short bursts up to bucket capacity
MemoryO(1) per user -- just store count + last refill time
PrecisionGood
StarvationNo -- tokens refill continuously

Implementation sketch (Redis + Lua):

lua
local key = KEYS[1] local capacity = tonumber(ARGV[1]) local refill_rate = tonumber(ARGV[2]) local now = tonumber(ARGV[3]) local data = redis.call('HMGET', key, 'tokens', 'last_refill') local tokens = tonumber(data[1]) or capacity local last_refill = tonumber(data[2]) or now local elapsed = now - last_refill tokens = math.min(capacity, tokens + elapsed * refill_rate) if tokens >= 1 then tokens = tokens - 1 redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now) redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) * 2) return 1 -- ALLOW else redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now) return 0 -- REJECT end

2. Leaky Bucket

Requests enter a FIFO queue and are processed at a fixed rate. Excess requests overflow.

  Incoming -->  [Queue (size B)]  --> Process at rate R
                     |
                 overflow (drop)
PropertyValue
Burst handlingNo -- smooths all traffic to constant rate
MemoryO(B) for the queue
Use caseTraffic shaping, smoothing bursty workloads

Trade-off vs Token Bucket: Leaky bucket provides a perfectly smooth output rate but penalizes legitimate burst traffic. Token bucket is more forgiving.

3. Fixed Window Counter

Divide time into fixed windows (e.g., 1-minute intervals). Count requests per window.

  Window 1        Window 2        Window 3
  [12:00-12:01]   [12:01-12:02]   [12:02-12:03]
  Count: 97/100   Count: 45/100   Count: 0/100
PropertyValue
MemoryO(1) per user
PrecisionLow -- boundary problem
SimplicityHighest

Boundary problem: A user can send 100 requests at 12:00:59 and 100 more at 12:01:01, getting 200 requests in 2 seconds despite a 100/min limit.

4. Sliding Window Log

Store the timestamp of every request. Count requests within the trailing window.

Window = 60s, Limit = 100

Timestamps: [t-55, t-42, t-30, t-18, t-5, t-2, t]
                                                ^-- new request
Count timestamps where ts > (now - 60s)
PropertyValue
MemoryO(N) per user -- stores every timestamp
PrecisionExact
CostHigh memory at scale

5. Sliding Window Counter (Hybrid)

Combines fixed window counting with weighted interpolation. Best balance of accuracy and efficiency.

Previous window count: 84  (weight: 0.3 of window elapsed)
Current window count:  25

Weighted count = 84 * 0.7 + 25 = 83.8
If 83.8 >= 100 --> REJECT
PropertyValue
MemoryO(1) per user -- two counters
PrecisionVery good (not exact, but close)
Use caseProduction systems needing accuracy + efficiency

Algorithm Comparison

AlgorithmMemoryAccuracyBurstComplexityProduction Use
Token BucketO(1)GoodYesLowAWS, Stripe, most gateways
Leaky BucketO(B)GoodNoMediumTraffic shaping
Fixed WindowO(1)LowBoundary bugLowestSimple internal services
Sliding LogO(N)ExactN/AMediumWhen precision is critical
Sliding CounterO(1)Very GoodPartialLowCloudflare, production APIs

Distributed Rate Limiting

Single-node rate limiting breaks when you scale horizontally. Two main approaches:

Centralized (Redis-based)

  Client --> LB --> [Server 1] --+
                   [Server 2] --+--> Redis (shared state)
                   [Server 3] --+

Pros: Globally consistent, accurate counts. Cons: Redis becomes a dependency; adds ~1ms latency per request.

Redis implementation patterns:

  • INCR + EXPIRE for fixed window
  • Lua scripts for token bucket (atomic operations)
  • Sorted sets for sliding window log
  • Redis Cluster for HA

Local + Sync (Approximate)

Each node maintains local counters and periodically syncs with peers.

  [Server 1: local=30] --sync--> [Server 2: local=25]
        \                              /
         +---> Gossip / broadcast <---+

Pros: No single point of failure, low latency. Cons: Approximate -- can briefly exceed limits during sync gaps.

When to use which:

  • Centralized: Billing-critical limits, strict quotas, API monetization
  • Local + Sync: Best-effort protection, DDoS mitigation, internal services

Race Condition Handling

-- WRONG (race condition):
count = GET(key)
if count < limit:
  SET(key, count + 1)  -- Another thread may have incremented

-- RIGHT (atomic):
count = INCR(key)
if count == 1:
  EXPIRE(key, window_size)
if count > limit:
  REJECT

Always use atomic operations (Lua scripts or Redis transactions).


Rate Limiting at Different Layers

  Client  -->  CDN/Edge  -->  API Gateway  -->  Application  -->  Database
    |              |               |                |               |
  Client-side   Edge rate      Gateway rate     App-level       Connection
  throttling    limiting       limiting         throttling      pooling
LayerToolGranularityUse Case
ClientExponential backoffPer clientPrevent self-DoS
CDN/EdgeCloudflare, AWS WAFIP, geoDDoS, bot protection
API GatewayKong, Envoy, NGINXAPI key, routeQuota enforcement
ApplicationCustom middlewareUser, tenant, endpointBusiness rules
DatabaseConnection pool limitsPer serviceProtect DB

HTTP Headers and Status Codes

Standard Response Headers

http
HTTP/1.1 429 Too Many Requests Retry-After: 30 X-RateLimit-Limit: 100 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1672531260 { "error": "rate_limit_exceeded", "message": "Rate limit of 100 requests per minute exceeded.", "retry_after": 30 }
HeaderPurpose
429 Too Many RequestsHTTP status code for rate limiting
Retry-AfterSeconds (or date) until client should retry
X-RateLimit-LimitMaximum requests allowed in window
X-RateLimit-RemainingRequests remaining in current window
X-RateLimit-ResetUnix timestamp when window resets

Client-Side Best Practices

python
import time import random def request_with_backoff(url, max_retries=5): for attempt in range(max_retries): response = http_get(url) if response.status != 429: return response retry_after = int(response.headers.get('Retry-After', 1)) jitter = random.uniform(0, 0.5) wait = retry_after + jitter time.sleep(wait) raise Exception("Max retries exceeded")

Always add jitter to avoid thundering herd when many clients retry simultaneously.


Interview Design Problem: Rate Limiter

Requirements Gathering

Ask these questions:

  1. Scale: How many users? How many requests/sec globally?
  2. Granularity: Per user? Per IP? Per API key? Per endpoint?
  3. Rules: Single rule or tiered (free/pro/enterprise)?
  4. Accuracy: Strict (billing) or approximate (protection)?
  5. Latency: What overhead is acceptable?
  6. Distributed: Single DC or multi-region?

High-Level Design

                    +------------------+
                    |   Rules Engine   |
                    | (config store)   |
                    +--------+---------+
                             |
  Client --> API Gateway --> Rate Limiter Middleware --> Service
                             |
                    +--------+---------+
                    |   Redis Cluster  |
                    | (counter store)  |
                    +------------------+

Detailed Design Decisions

DecisionChoiceRationale
AlgorithmSliding window counterBest accuracy/memory trade-off
StorageRedis ClusterSub-ms latency, atomic ops, HA
Rules storageConfig service or DBRules change infrequently
Key formatrate:{user_id}:{endpoint}:{window}Flexible granularity
Failure modeAllow (fail open)Don't block users if Redis is down

Handling Edge Cases

  • Multi-region: Each region can have local Redis + periodic cross-region sync
  • Redis failure: Fail open with local in-memory fallback (degrade gracefully)
  • Clock skew: Use Redis server time (TIME command), not client time
  • Hot keys: Shard by user ID; for extremely hot keys, use local caching

Real-World Examples

CompanyApproachDetails
GitHub APIToken bucket5,000 req/hr per authenticated user
StripeToken bucket + sliding windowPer-API-key with burst allowance
CloudflareSliding window counterEdge-based, per IP/route
Twitter/XFixed window15-min windows, per endpoint
DiscordToken bucketPer route, per user, global

Interview Tips

  1. Start with requirements: Don't jump to token bucket. Ask about accuracy needs, scale, and failure modes.
  2. Discuss trade-offs: No algorithm is universally best. Show you understand when each shines.
  3. Address distributed concerns: Single-node rate limiting is trivial. The hard part is consistency across nodes.
  4. Mention fail-open vs fail-closed: This shows operational maturity.
  5. Talk about multi-tenancy: Different rate limits for different tiers is a common follow-up.

Resources

  • DDIA (Kleppmann) -- Chapter 8: The Trouble with Distributed Systems
  • System Design Interview Vol. 1 (Alex Xu) -- Chapter 4: Design a Rate Limiter
  • Stripe Engineering Blog: "Rate Limiters and Load Shedders"
  • Cloudflare Blog: "How We Built Rate Limiting"
  • IETF RFC 6585 -- HTTP 429 Status Code
  • Google Cloud Architecture: "Rate-limiting strategies and techniques"
  • Token Bucket on Wikipedia for algorithm formalization

Previous: 16 - Microservices Architecture | Next: 18 - Unique ID Generation