28 - Design Rate Limiter
Previous: 27 - Design URL Shortener | Next: 29 - Design Key-Value Store
Why This Matters in Interviews
Rate limiting protects systems from abuse, prevents cascading failures, and enforces fair usage. It appears in nearly every system design interview -- either as the primary question or as a component you should mention proactively.
Step 1: Requirements
Functional Requirements
- Limit number of requests a client can send within a time window
- Support multiple limiting rules (per user, per IP, per API key, per endpoint)
- Return informative response when rate limited
- Work in a distributed environment (multiple servers)
Non-Functional Requirements
| Requirement | Target |
|---|
| Latency overhead | < 1ms per request |
| Availability | Must not become a single point of failure |
| Accuracy | Acceptable to be slightly over-limit (not under) |
| Scale | Handle millions of QPS across the fleet |
| Fault tolerance | If rate limiter fails, traffic should pass (fail-open) |
Step 2: Where to Place It
Option 1: Client-side
[Client] ---(rate limit locally)---> [Server]
Problem: Clients can bypass it. Not trustworthy.
Option 2: Server-side (application layer)
[Client] ---> [Server + Rate Limiter] ---> [Business Logic]
Problem: Each server counts independently.
Option 3: Middleware / API Gateway (Recommended)
[Client] ---> [API Gateway / Rate Limiter] ---> [Server]
Benefits: Centralized, consistent, offloads work from app servers.
Option 4: Service mesh sidecar
[Client] ---> [Envoy Proxy + Rate Limit] ---> [Server]
Benefits: Transparent to application code.
Decision Matrix
| Placement | Pros | Cons |
|---|
| Client | Reduces unnecessary calls | Untrusted, easily bypassed |
| Server | Access to user context | Inconsistent across instances |
| API Gateway | Centralized, consistent | Potential bottleneck |
| Service Mesh | Transparent, per-service | Infrastructure complexity |
Interview recommendation: API Gateway for external APIs, sidecar/middleware for internal service-to-service limiting.
Step 3: Rate Limiting Algorithms
Token Bucket
Bucket capacity: 4 tokens
Refill rate: 2 tokens/second
Time 0s: [****] (4 tokens, full)
Request: [***-] (consume 1 token, 3 left)
Request: [**--] (consume 1 token, 2 left)
Request: [*---] (consume 1 token, 1 left)
Request: [----] (consume 1 token, 0 left)
Request: REJECTED (no tokens)
Time 1s: [**--] (2 tokens refilled)
Request: [*---] (consume 1, 1 left)
| Property | Value |
|---|
| Parameters | Bucket size (max burst), refill rate (sustained rate) |
| Burst allowed? | Yes (up to bucket size) |
| Memory | O(1) per user (counter + timestamp) |
| Used by | AWS API Gateway, Stripe |
Leaky Bucket
Bucket processes requests at fixed rate, regardless of burst:
Incoming: |||||||| (burst of 8 requests)
Queue: [||||||||] (queued)
Output: |..|..|.. (processed at fixed rate: 1 per interval)
If queue is full --> REJECT
| Property | Value |
|---|
| Parameters | Bucket size (queue length), outflow rate |
| Burst allowed? | No (smooths traffic to constant rate) |
| Memory | O(1) per user |
| Used by | Networking (packet shaping) |
Fixed Window Counter
Window: 1 minute, limit: 100 requests
Minute 1 [00:00-01:00]: count = 87 --> ALLOW
Minute 2 [01:00-02:00]: count = 0 --> starts fresh
Problem - boundary burst:
[00:30-01:00]: 100 requests (allowed)
[01:00-01:30]: 100 requests (allowed, new window)
Result: 200 requests in 1 minute (2x limit!)
| Property | Value |
|---|
| Parameters | Window size, max requests per window |
| Burst at boundary? | Yes (major drawback) |
| Memory | O(1) per user |
Sliding Window Log
Track timestamp of each request in a sorted set:
Requests at: 00:15, 00:30, 00:45, 01:10, 01:15, 01:20
Window = 1 min, checking at 01:20:
Keep only: [00:21, 01:20] (last 60 seconds)
In window: 01:10, 01:15, 01:20 = 3 requests
| Property | Value |
|---|
| Parameters | Window size, max requests |
| Burst at boundary? | No (precise) |
| Memory | O(N) per user (stores all timestamps) |
| Drawback | High memory for high-rate users |
Sliding Window Counter (Best Trade-off)
Combines fixed window + weighted previous window:
Previous window (00:00-01:00): 84 requests
Current window (01:00-02:00): 36 requests (so far)
Current position in window: 25% through
Weighted count = 84 * (1 - 0.25) + 36 = 63 + 36 = 99
Limit: 100 --> ALLOW (just barely)
| Property | Value |
|---|
| Parameters | Window size, max requests |
| Burst at boundary? | Minimal (smoothed) |
| Memory | O(1) per user (two counters + timestamp) |
| Accuracy | Approximate but very close |
Algorithm Comparison
| Algorithm | Memory | Accuracy | Burst Handling | Complexity |
|---|
| Token Bucket | O(1) | Good | Allows controlled burst | Low |
| Leaky Bucket | O(1) | Good | No burst (smooths) | Low |
| Fixed Window | O(1) | Poor (boundary) | Allows 2x burst | Low |
| Sliding Log | O(N) | Exact | No boundary issue | Medium |
| Sliding Window Counter | O(1) | Very good | Minimal boundary issue | Low |
Interview recommendation: Token Bucket for most cases (allows bursts, simple, low memory). Sliding Window Counter when boundary accuracy matters.
Step 4: Distributed Rate Limiting with Redis
Why Redis?
- Sub-millisecond latency
- Atomic operations (INCR, EXPIRE)
- Built-in TTL for key expiry
- Cluster mode for horizontal scaling
Token Bucket with Redis + Lua
lua
-- Lua script (atomic in Redis)
local key = KEYS[1]
local capacity = tonumber(ARGV[1]) -- max tokens
local refill_rate = tonumber(ARGV[2]) -- tokens per second
local now = tonumber(ARGV[3]) -- current timestamp
local bucket = redis.call('hmget', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Refill tokens based on elapsed time
local elapsed = now - last_refill
local new_tokens = math.min(capacity, tokens + elapsed * refill_rate)
if new_tokens >= 1 then
new_tokens = new_tokens - 1
redis.call('hmset', key, 'tokens', new_tokens, 'last_refill', now)
redis.call('expire', key, capacity / refill_rate * 2)
return 1 -- ALLOWED
else
redis.call('hmset', key, 'tokens', new_tokens, 'last_refill', now)
return 0 -- REJECTED
end
Why Lua? The entire script executes atomically in Redis -- no race conditions between read and write, even with concurrent requests.
Sliding Window Counter with Redis
lua
-- Lua script for sliding window counter
local key = KEYS[1]
local window_size = tonumber(ARGV[1]) -- e.g., 60 seconds
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local current_window = math.floor(now / window_size)
local previous_window = current_window - 1
local position_in_window = (now % window_size) / window_size
local current_key = key .. ":" .. current_window
local previous_key = key .. ":" .. previous_window
local current_count = tonumber(redis.call('get', current_key) or "0")
local previous_count = tonumber(redis.call('get', previous_key) or "0")
local weighted = previous_count * (1 - position_in_window) + current_count
if weighted < limit then
redis.call('incr', current_key)
redis.call('expire', current_key, window_size * 2)
return 1 -- ALLOWED
else
return 0 -- REJECTED
end
Step 5: System Architecture
+------------------+
| Load Balancer |
+--------+---------+
|
+------------------+------------------+
| | |
+-------+------+ +-------+------+ +-------+------+
| API Gateway | | API Gateway | | API Gateway |
| + Rate Limit | | + Rate Limit | | + Rate Limit |
| Middleware | | Middleware | | Middleware |
+-------+------+ +-------+------+ +-------+------+
| | |
+--------+---------+--------+--------+
| |
+-------+-------+ +-------+-------+
| Redis Cluster | | Rules Engine |
| (counters, | | (rate limit |
| tokens) | | configs) |
+----------------+ +---------------+
|
+--------+---------+
| |
+-------+------+ +-------+------+
| App Server 1 | | App Server N |
+-------+------+ +-------+------+
Step 6: Rate Limiter Rules Engine
Rules define who, what, and how much:
yaml
rules:
- name: "Global API limit"
key: "global"
limit: 1000000
window: "1m"
- name: "Per-user API limit"
key: "user:{user_id}"
limit: 1000
window: "1h"
- name: "Per-IP login limit"
key: "ip:{ip}:login"
limit: 5
window: "15m"
- name: "Per-user expensive endpoint"
key: "user:{user_id}:search"
limit: 30
window: "1m"
Evaluation order: Most specific rule wins. Check per-endpoint, then per-user, then per-IP, then global.
Step 7: Handling Race Conditions
The Lost Update Problem
Without atomicity:
Server A: READ count = 99 (limit = 100)
Server B: READ count = 99
Server A: WRITE count = 100 (allowed)
Server B: WRITE count = 100 (also allowed!)
Result: 101 requests processed (over limit)
Solutions
| Approach | How | Trade-off |
|---|
| Redis Lua scripts | Atomic read-modify-write | Requires Redis, adds network hop |
| Redis MULTI/EXEC | Transaction block | Less flexible than Lua |
| Local + sync | Local counter synced periodically | Slightly over-limit during sync gaps |
| Sorted sets | ZRANGEBYSCORE + ZCARD atomic | Higher memory for sliding window |
Step 8: Rate Limiting Dimensions
| Dimension | Key Pattern | Use Case |
|---|
| User ID | user:12345 | Authenticated API usage |
| IP Address | ip:1.2.3.4 | Anonymous/unauthenticated |
| API Key | key:abc123 | Third-party integrations |
| Endpoint | user:12345:/api/search | Expensive operations |
| Tenant | tenant:acme | Multi-tenant SaaS |
| Global | global | Overall system protection |
Layered limiting: Apply multiple limits simultaneously. A request must pass ALL applicable limits.
Step 9: HTTP Response Design
When Allowed
HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 742
X-RateLimit-Reset: 1710500400
When Rate Limited
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1710500400
Retry-After: 30
Content-Type: application/json
{
"error": "rate_limit_exceeded",
"message": "Rate limit exceeded. Try again in 30 seconds.",
"retry_after": 30
}
Key headers:
| Header | Purpose |
|---|
X-RateLimit-Limit | Total allowed requests in window |
X-RateLimit-Remaining | Requests remaining in current window |
X-RateLimit-Reset | Unix timestamp when the window resets |
Retry-After | Seconds until the client should retry |
Step 10: Monitoring
Dashboard:
+-----------------------------------+-----------------------------------+
| Rate Limited Requests/sec | Top Rate-Limited Users |
| [line chart by endpoint] | user:12345 - 4,523 rejections |
| | user:67890 - 2,100 rejections |
+-----------------------------------+-----------------------------------+
| Rejection Rate by Rule | Redis Latency (p50/p99) |
| Login: 12.3% | p50: 0.2ms p99: 0.8ms |
| Search: 3.1% | [line chart] |
| Global: 0.01% | |
+-----------------------------------+-----------------------------------+
Key metrics to track:
- Total requests allowed vs rejected (by rule, by dimension)
- Redis latency for rate limit checks
- False positive rate (legitimate users being limited)
- Rule hit distribution (which rules trigger most?)
Step 11: Performance Considerations
| Concern | Solution |
|---|
| Redis latency | Use Redis cluster, co-locate with app servers |
| Redis failure | Fail-open (allow all traffic if Redis is down) |
| Hot keys | Shard by user/IP hash across Redis nodes |
| Lua script CPU | Keep scripts simple, avoid loops |
| Network overhead | Batch multiple limit checks in one roundtrip |
| Local caching | Cache "definitely under limit" locally for short TTL |
Hybrid Approach (Local + Centralized)
Request --> [Local counter (fast, approximate)]
|
Under local threshold? --> Allow immediately
|
Near threshold? --> [Check Redis (precise)]
|
Over limit? --> Reject
This reduces Redis calls by 80%+ for well-behaved clients.
Edge Cases
| Scenario | Handling |
|---|
| Clock skew across servers | Use Redis server time (consistent) |
| Rate limiter becomes bottleneck | Fail-open, alert, auto-scale Redis |
| Distributed client (many IPs) | Rate limit by user ID, not just IP |
| Legitimate burst (product launch) | Temporary rule override, higher limits |
| VPN/proxy users sharing IP | Combine IP + user-agent + fingerprint |
Interview Walkthrough Checklist
[ ] Clarify requirements (which dimension, how strict, distributed?)
[ ] Choose placement (API gateway / middleware)
[ ] Explain algorithms (token bucket vs sliding window)
[ ] Justify choice for the scenario
[ ] Design distributed solution with Redis + Lua
[ ] Draw system architecture
[ ] HTTP response design (429, headers)
[ ] Handle race conditions (atomic operations)
[ ] Monitoring and alerting
[ ] Edge cases and failure modes
Interview Tips
- Know at least two algorithms cold and their trade-offs (token bucket + sliding window counter).
- Redis + Lua is the standard distributed answer. Explain why atomicity matters.
- Fail-open vs fail-closed is a critical decision. Most systems fail-open (allow traffic if rate limiter is down).
- HTTP 429 + headers shows API design maturity. Don't forget
Retry-After.
- Layered limits (global + per-user + per-endpoint) shows depth.
- Monitoring -- if you can't see what the rate limiter is doing, you can't tune it.
Resources
- System Design Interview (Alex Xu): Chapter 4 - Design a Rate Limiter
- Stripe Engineering Blog: "Scaling API Rate Limiting at Stripe"
- Cloudflare Blog: "How We Built Rate Limiting"
- Redis documentation: Rate limiting patterns
- Kong: Rate Limiting plugin architecture
- Google Cloud: "Rate Limiting Strategies and Techniques"
Previous: 27 - Design URL Shortener | Next: 29 - Design Key-Value Store