28 - Design Rate Limiter

Previous: 27 - Design URL Shortener | Next: 29 - Design Key-Value Store

Why This Matters in Interviews

Rate limiting protects systems from abuse, prevents cascading failures, and enforces fair usage. It appears in nearly every system design interview -- either as the primary question or as a component you should mention proactively.

Step 1: Requirements

Functional Requirements

Limit number of requests a client can send within a time window
Support multiple limiting rules (per user, per IP, per API key, per endpoint)
Return informative response when rate limited
Work in a distributed environment (multiple servers)

Non-Functional Requirements

Requirement	Target
Latency overhead	< 1ms per request
Availability	Must not become a single point of failure
Accuracy	Acceptable to be slightly over-limit (not under)
Scale	Handle millions of QPS across the fleet
Fault tolerance	If rate limiter fails, traffic should pass (fail-open)

Step 2: Where to Place It

Option 1: Client-side
  [Client] ---(rate limit locally)---> [Server]
  Problem: Clients can bypass it. Not trustworthy.

Option 2: Server-side (application layer)
  [Client] ---> [Server + Rate Limiter] ---> [Business Logic]
  Problem: Each server counts independently.

Option 3: Middleware / API Gateway (Recommended)
  [Client] ---> [API Gateway / Rate Limiter] ---> [Server]
  Benefits: Centralized, consistent, offloads work from app servers.

Option 4: Service mesh sidecar
  [Client] ---> [Envoy Proxy + Rate Limit] ---> [Server]
  Benefits: Transparent to application code.

Decision Matrix

Placement	Pros	Cons
Client	Reduces unnecessary calls	Untrusted, easily bypassed
Server	Access to user context	Inconsistent across instances
API Gateway	Centralized, consistent	Potential bottleneck
Service Mesh	Transparent, per-service	Infrastructure complexity

Interview recommendation: API Gateway for external APIs, sidecar/middleware for internal service-to-service limiting.

Step 3: Rate Limiting Algorithms

Token Bucket

Bucket capacity: 4 tokens
Refill rate: 2 tokens/second

Time 0s:  [****]  (4 tokens, full)
Request:  [***-]  (consume 1 token, 3 left)
Request:  [**--]  (consume 1 token, 2 left)
Request:  [*---]  (consume 1 token, 1 left)
Request:  [----]  (consume 1 token, 0 left)
Request:  REJECTED (no tokens)

Time 1s:  [**--]  (2 tokens refilled)
Request:  [*---]  (consume 1, 1 left)

Property	Value
Parameters	Bucket size (max burst), refill rate (sustained rate)
Burst allowed?	Yes (up to bucket size)
Memory	O(1) per user (counter + timestamp)
Used by	AWS API Gateway, Stripe

Leaky Bucket

Bucket processes requests at fixed rate, regardless of burst:

Incoming:  ||||||||   (burst of 8 requests)
Queue:     [||||||||] (queued)
Output:    |..|..|..  (processed at fixed rate: 1 per interval)

If queue is full --> REJECT

Property	Value
Parameters	Bucket size (queue length), outflow rate
Burst allowed?	No (smooths traffic to constant rate)
Memory	O(1) per user
Used by	Networking (packet shaping)

Fixed Window Counter

Window: 1 minute, limit: 100 requests

Minute 1 [00:00-01:00]:  count = 87  --> ALLOW
Minute 2 [01:00-02:00]:  count = 0   --> starts fresh

Problem - boundary burst:
  [00:30-01:00]: 100 requests (allowed)
  [01:00-01:30]: 100 requests (allowed, new window)
  Result: 200 requests in 1 minute (2x limit!)

Property	Value
Parameters	Window size, max requests per window
Burst at boundary?	Yes (major drawback)
Memory	O(1) per user

Sliding Window Log

Track timestamp of each request in a sorted set:

Requests at: 00:15, 00:30, 00:45, 01:10, 01:15, 01:20
Window = 1 min, checking at 01:20:

Keep only: [00:21, 01:20]  (last 60 seconds)
In window: 01:10, 01:15, 01:20 = 3 requests

Property	Value
Parameters	Window size, max requests
Burst at boundary?	No (precise)
Memory	O(N) per user (stores all timestamps)
Drawback	High memory for high-rate users

Sliding Window Counter (Best Trade-off)

Combines fixed window + weighted previous window:

Previous window (00:00-01:00): 84 requests
Current window  (01:00-02:00): 36 requests (so far)
Current position in window: 25% through

Weighted count = 84 * (1 - 0.25) + 36 = 63 + 36 = 99
Limit: 100 --> ALLOW (just barely)

Property	Value
Parameters	Window size, max requests
Burst at boundary?	Minimal (smoothed)
Memory	O(1) per user (two counters + timestamp)
Accuracy	Approximate but very close

Algorithm Comparison

Algorithm	Memory	Accuracy	Burst Handling	Complexity
Token Bucket	O(1)	Good	Allows controlled burst	Low
Leaky Bucket	O(1)	Good	No burst (smooths)	Low
Fixed Window	O(1)	Poor (boundary)	Allows 2x burst	Low
Sliding Log	O(N)	Exact	No boundary issue	Medium
Sliding Window Counter	O(1)	Very good	Minimal boundary issue	Low

Interview recommendation: Token Bucket for most cases (allows bursts, simple, low memory). Sliding Window Counter when boundary accuracy matters.

Step 4: Distributed Rate Limiting with Redis

Why Redis?

Sub-millisecond latency
Atomic operations (INCR, EXPIRE)
Built-in TTL for key expiry
Cluster mode for horizontal scaling

Token Bucket with Redis + Lua

lua
-- Lua script (atomic in Redis)
local key = KEYS[1]
local capacity = tonumber(ARGV[1])     -- max tokens
local refill_rate = tonumber(ARGV[2])  -- tokens per second
local now = tonumber(ARGV[3])          -- current timestamp

local bucket = redis.call('hmget', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

-- Refill tokens based on elapsed time
local elapsed = now - last_refill
local new_tokens = math.min(capacity, tokens + elapsed * refill_rate)

if new_tokens >= 1 then
  new_tokens = new_tokens - 1
  redis.call('hmset', key, 'tokens', new_tokens, 'last_refill', now)
  redis.call('expire', key, capacity / refill_rate * 2)
  return 1  -- ALLOWED
else
  redis.call('hmset', key, 'tokens', new_tokens, 'last_refill', now)
  return 0  -- REJECTED
end

Why Lua? The entire script executes atomically in Redis -- no race conditions between read and write, even with concurrent requests.

Sliding Window Counter with Redis

lua
-- Lua script for sliding window counter
local key = KEYS[1]
local window_size = tonumber(ARGV[1])  -- e.g., 60 seconds
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local current_window = math.floor(now / window_size)
local previous_window = current_window - 1
local position_in_window = (now % window_size) / window_size

local current_key = key .. ":" .. current_window
local previous_key = key .. ":" .. previous_window

local current_count = tonumber(redis.call('get', current_key) or "0")
local previous_count = tonumber(redis.call('get', previous_key) or "0")

local weighted = previous_count * (1 - position_in_window) + current_count

if weighted < limit then
  redis.call('incr', current_key)
  redis.call('expire', current_key, window_size * 2)
  return 1  -- ALLOWED
else
  return 0  -- REJECTED
end

Step 5: System Architecture

                        +------------------+
                        |  Load Balancer   |
                        +--------+---------+
                                 |
              +------------------+------------------+
              |                  |                  |
      +-------+------+  +-------+------+  +-------+------+
      | API Gateway  |  | API Gateway  |  | API Gateway  |
      | + Rate Limit |  | + Rate Limit |  | + Rate Limit |
      | Middleware    |  | Middleware    |  | Middleware    |
      +-------+------+  +-------+------+  +-------+------+
              |                  |                  |
              +--------+---------+--------+--------+
                       |                           |
               +-------+-------+           +-------+-------+
               |  Redis Cluster |           | Rules Engine  |
               |  (counters,    |           | (rate limit   |
               |   tokens)      |           |  configs)     |
               +----------------+           +---------------+
                       |
              +--------+---------+
              |                  |
      +-------+------+  +-------+------+
      | App Server 1 |  | App Server N |
      +-------+------+  +-------+------+

Step 6: Rate Limiter Rules Engine

Rules define who, what, and how much:

yaml
rules:
  - name: "Global API limit"
    key: "global"
    limit: 1000000
    window: "1m"

  - name: "Per-user API limit"
    key: "user:{user_id}"
    limit: 1000
    window: "1h"

  - name: "Per-IP login limit"
    key: "ip:{ip}:login"
    limit: 5
    window: "15m"

  - name: "Per-user expensive endpoint"
    key: "user:{user_id}:search"
    limit: 30
    window: "1m"

Evaluation order: Most specific rule wins. Check per-endpoint, then per-user, then per-IP, then global.

Step 7: Handling Race Conditions

The Lost Update Problem

Without atomicity:
  Server A: READ count = 99  (limit = 100)
  Server B: READ count = 99
  Server A: WRITE count = 100 (allowed)
  Server B: WRITE count = 100 (also allowed!)
  Result: 101 requests processed (over limit)

Solutions

Approach	How	Trade-off
Redis Lua scripts	Atomic read-modify-write	Requires Redis, adds network hop
Redis MULTI/EXEC	Transaction block	Less flexible than Lua
Local + sync	Local counter synced periodically	Slightly over-limit during sync gaps
Sorted sets	ZRANGEBYSCORE + ZCARD atomic	Higher memory for sliding window

Step 8: Rate Limiting Dimensions

Dimension	Key Pattern	Use Case
User ID	`user:12345`	Authenticated API usage
IP Address	`ip:1.2.3.4`	Anonymous/unauthenticated
API Key	`key:abc123`	Third-party integrations
Endpoint	`user:12345:/api/search`	Expensive operations
Tenant	`tenant:acme`	Multi-tenant SaaS
Global	`global`	Overall system protection

Layered limiting: Apply multiple limits simultaneously. A request must pass ALL applicable limits.

Step 9: HTTP Response Design

When Allowed

HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 742
X-RateLimit-Reset: 1710500400

When Rate Limited

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1710500400
Retry-After: 30
Content-Type: application/json

{
  "error": "rate_limit_exceeded",
  "message": "Rate limit exceeded. Try again in 30 seconds.",
  "retry_after": 30
}

Key headers:

Header	Purpose
`X-RateLimit-Limit`	Total allowed requests in window
`X-RateLimit-Remaining`	Requests remaining in current window
`X-RateLimit-Reset`	Unix timestamp when the window resets
`Retry-After`	Seconds until the client should retry

Step 10: Monitoring

Dashboard:
+-----------------------------------+-----------------------------------+
| Rate Limited Requests/sec         | Top Rate-Limited Users            |
| [line chart by endpoint]          | user:12345  - 4,523 rejections   |
|                                   | user:67890  - 2,100 rejections   |
+-----------------------------------+-----------------------------------+
| Rejection Rate by Rule            | Redis Latency (p50/p99)          |
| Login: 12.3%                      | p50: 0.2ms  p99: 0.8ms          |
| Search: 3.1%                      | [line chart]                     |
| Global: 0.01%                     |                                  |
+-----------------------------------+-----------------------------------+

Key metrics to track:

Total requests allowed vs rejected (by rule, by dimension)
Redis latency for rate limit checks
False positive rate (legitimate users being limited)
Rule hit distribution (which rules trigger most?)

Step 11: Performance Considerations

Concern	Solution
Redis latency	Use Redis cluster, co-locate with app servers
Redis failure	Fail-open (allow all traffic if Redis is down)
Hot keys	Shard by user/IP hash across Redis nodes
Lua script CPU	Keep scripts simple, avoid loops
Network overhead	Batch multiple limit checks in one roundtrip
Local caching	Cache "definitely under limit" locally for short TTL

Hybrid Approach (Local + Centralized)

Request --> [Local counter (fast, approximate)]
              |
        Under local threshold? --> Allow immediately
              |
        Near threshold? --> [Check Redis (precise)]
              |
        Over limit? --> Reject

This reduces Redis calls by 80%+ for well-behaved clients.

Edge Cases

Scenario	Handling
Clock skew across servers	Use Redis server time (consistent)
Rate limiter becomes bottleneck	Fail-open, alert, auto-scale Redis
Distributed client (many IPs)	Rate limit by user ID, not just IP
Legitimate burst (product launch)	Temporary rule override, higher limits
VPN/proxy users sharing IP	Combine IP + user-agent + fingerprint

Interview Walkthrough Checklist

[ ] Clarify requirements (which dimension, how strict, distributed?)
[ ] Choose placement (API gateway / middleware)
[ ] Explain algorithms (token bucket vs sliding window)
[ ] Justify choice for the scenario
[ ] Design distributed solution with Redis + Lua
[ ] Draw system architecture
[ ] HTTP response design (429, headers)
[ ] Handle race conditions (atomic operations)
[ ] Monitoring and alerting
[ ] Edge cases and failure modes

Interview Tips

Know at least two algorithms cold and their trade-offs (token bucket + sliding window counter).
Redis + Lua is the standard distributed answer. Explain why atomicity matters.
Fail-open vs fail-closed is a critical decision. Most systems fail-open (allow traffic if rate limiter is down).
HTTP 429 + headers shows API design maturity. Don't forget Retry-After.
Layered limits (global + per-user + per-endpoint) shows depth.
Monitoring -- if you can't see what the rate limiter is doing, you can't tune it.

Resources

System Design Interview (Alex Xu): Chapter 4 - Design a Rate Limiter
Stripe Engineering Blog: "Scaling API Rate Limiting at Stripe"
Cloudflare Blog: "How We Built Rate Limiting"
Redis documentation: Rate limiting patterns
Kong: Rate Limiting plugin architecture
Google Cloud: "Rate Limiting Strategies and Techniques"

Previous: 27 - Design URL Shortener | Next: 29 - Design Key-Value Store