28 - Design Rate Limiter

Previous: 27 - Design URL Shortener | Next: 29 - Design Key-Value Store


Why This Matters in Interviews

Rate limiting protects systems from abuse, prevents cascading failures, and enforces fair usage. It appears in nearly every system design interview -- either as the primary question or as a component you should mention proactively.


Step 1: Requirements

Functional Requirements

  1. Limit number of requests a client can send within a time window
  2. Support multiple limiting rules (per user, per IP, per API key, per endpoint)
  3. Return informative response when rate limited
  4. Work in a distributed environment (multiple servers)

Non-Functional Requirements

RequirementTarget
Latency overhead< 1ms per request
AvailabilityMust not become a single point of failure
AccuracyAcceptable to be slightly over-limit (not under)
ScaleHandle millions of QPS across the fleet
Fault toleranceIf rate limiter fails, traffic should pass (fail-open)

Step 2: Where to Place It

Option 1: Client-side
  [Client] ---(rate limit locally)---> [Server]
  Problem: Clients can bypass it. Not trustworthy.

Option 2: Server-side (application layer)
  [Client] ---> [Server + Rate Limiter] ---> [Business Logic]
  Problem: Each server counts independently.

Option 3: Middleware / API Gateway (Recommended)
  [Client] ---> [API Gateway / Rate Limiter] ---> [Server]
  Benefits: Centralized, consistent, offloads work from app servers.

Option 4: Service mesh sidecar
  [Client] ---> [Envoy Proxy + Rate Limit] ---> [Server]
  Benefits: Transparent to application code.

Decision Matrix

PlacementProsCons
ClientReduces unnecessary callsUntrusted, easily bypassed
ServerAccess to user contextInconsistent across instances
API GatewayCentralized, consistentPotential bottleneck
Service MeshTransparent, per-serviceInfrastructure complexity

Interview recommendation: API Gateway for external APIs, sidecar/middleware for internal service-to-service limiting.


Step 3: Rate Limiting Algorithms

Token Bucket

Bucket capacity: 4 tokens
Refill rate: 2 tokens/second

Time 0s:  [****]  (4 tokens, full)
Request:  [***-]  (consume 1 token, 3 left)
Request:  [**--]  (consume 1 token, 2 left)
Request:  [*---]  (consume 1 token, 1 left)
Request:  [----]  (consume 1 token, 0 left)
Request:  REJECTED (no tokens)

Time 1s:  [**--]  (2 tokens refilled)
Request:  [*---]  (consume 1, 1 left)
PropertyValue
ParametersBucket size (max burst), refill rate (sustained rate)
Burst allowed?Yes (up to bucket size)
MemoryO(1) per user (counter + timestamp)
Used byAWS API Gateway, Stripe

Leaky Bucket

Bucket processes requests at fixed rate, regardless of burst:

Incoming:  ||||||||   (burst of 8 requests)
Queue:     [||||||||] (queued)
Output:    |..|..|..  (processed at fixed rate: 1 per interval)

If queue is full --> REJECT
PropertyValue
ParametersBucket size (queue length), outflow rate
Burst allowed?No (smooths traffic to constant rate)
MemoryO(1) per user
Used byNetworking (packet shaping)

Fixed Window Counter

Window: 1 minute, limit: 100 requests

Minute 1 [00:00-01:00]:  count = 87  --> ALLOW
Minute 2 [01:00-02:00]:  count = 0   --> starts fresh

Problem - boundary burst:
  [00:30-01:00]: 100 requests (allowed)
  [01:00-01:30]: 100 requests (allowed, new window)
  Result: 200 requests in 1 minute (2x limit!)
PropertyValue
ParametersWindow size, max requests per window
Burst at boundary?Yes (major drawback)
MemoryO(1) per user

Sliding Window Log

Track timestamp of each request in a sorted set:

Requests at: 00:15, 00:30, 00:45, 01:10, 01:15, 01:20
Window = 1 min, checking at 01:20:

Keep only: [00:21, 01:20]  (last 60 seconds)
In window: 01:10, 01:15, 01:20 = 3 requests
PropertyValue
ParametersWindow size, max requests
Burst at boundary?No (precise)
MemoryO(N) per user (stores all timestamps)
DrawbackHigh memory for high-rate users

Sliding Window Counter (Best Trade-off)

Combines fixed window + weighted previous window:

Previous window (00:00-01:00): 84 requests
Current window  (01:00-02:00): 36 requests (so far)
Current position in window: 25% through

Weighted count = 84 * (1 - 0.25) + 36 = 63 + 36 = 99
Limit: 100 --> ALLOW (just barely)
PropertyValue
ParametersWindow size, max requests
Burst at boundary?Minimal (smoothed)
MemoryO(1) per user (two counters + timestamp)
AccuracyApproximate but very close

Algorithm Comparison

AlgorithmMemoryAccuracyBurst HandlingComplexity
Token BucketO(1)GoodAllows controlled burstLow
Leaky BucketO(1)GoodNo burst (smooths)Low
Fixed WindowO(1)Poor (boundary)Allows 2x burstLow
Sliding LogO(N)ExactNo boundary issueMedium
Sliding Window CounterO(1)Very goodMinimal boundary issueLow

Interview recommendation: Token Bucket for most cases (allows bursts, simple, low memory). Sliding Window Counter when boundary accuracy matters.


Step 4: Distributed Rate Limiting with Redis

Why Redis?

  • Sub-millisecond latency
  • Atomic operations (INCR, EXPIRE)
  • Built-in TTL for key expiry
  • Cluster mode for horizontal scaling

Token Bucket with Redis + Lua

lua
-- Lua script (atomic in Redis) local key = KEYS[1] local capacity = tonumber(ARGV[1]) -- max tokens local refill_rate = tonumber(ARGV[2]) -- tokens per second local now = tonumber(ARGV[3]) -- current timestamp local bucket = redis.call('hmget', key, 'tokens', 'last_refill') local tokens = tonumber(bucket[1]) or capacity local last_refill = tonumber(bucket[2]) or now -- Refill tokens based on elapsed time local elapsed = now - last_refill local new_tokens = math.min(capacity, tokens + elapsed * refill_rate) if new_tokens >= 1 then new_tokens = new_tokens - 1 redis.call('hmset', key, 'tokens', new_tokens, 'last_refill', now) redis.call('expire', key, capacity / refill_rate * 2) return 1 -- ALLOWED else redis.call('hmset', key, 'tokens', new_tokens, 'last_refill', now) return 0 -- REJECTED end

Why Lua? The entire script executes atomically in Redis -- no race conditions between read and write, even with concurrent requests.

Sliding Window Counter with Redis

lua
-- Lua script for sliding window counter local key = KEYS[1] local window_size = tonumber(ARGV[1]) -- e.g., 60 seconds local limit = tonumber(ARGV[2]) local now = tonumber(ARGV[3]) local current_window = math.floor(now / window_size) local previous_window = current_window - 1 local position_in_window = (now % window_size) / window_size local current_key = key .. ":" .. current_window local previous_key = key .. ":" .. previous_window local current_count = tonumber(redis.call('get', current_key) or "0") local previous_count = tonumber(redis.call('get', previous_key) or "0") local weighted = previous_count * (1 - position_in_window) + current_count if weighted < limit then redis.call('incr', current_key) redis.call('expire', current_key, window_size * 2) return 1 -- ALLOWED else return 0 -- REJECTED end

Step 5: System Architecture

                        +------------------+
                        |  Load Balancer   |
                        +--------+---------+
                                 |
              +------------------+------------------+
              |                  |                  |
      +-------+------+  +-------+------+  +-------+------+
      | API Gateway  |  | API Gateway  |  | API Gateway  |
      | + Rate Limit |  | + Rate Limit |  | + Rate Limit |
      | Middleware    |  | Middleware    |  | Middleware    |
      +-------+------+  +-------+------+  +-------+------+
              |                  |                  |
              +--------+---------+--------+--------+
                       |                           |
               +-------+-------+           +-------+-------+
               |  Redis Cluster |           | Rules Engine  |
               |  (counters,    |           | (rate limit   |
               |   tokens)      |           |  configs)     |
               +----------------+           +---------------+
                       |
              +--------+---------+
              |                  |
      +-------+------+  +-------+------+
      | App Server 1 |  | App Server N |
      +-------+------+  +-------+------+

Step 6: Rate Limiter Rules Engine

Rules define who, what, and how much:

yaml
rules: - name: "Global API limit" key: "global" limit: 1000000 window: "1m" - name: "Per-user API limit" key: "user:{user_id}" limit: 1000 window: "1h" - name: "Per-IP login limit" key: "ip:{ip}:login" limit: 5 window: "15m" - name: "Per-user expensive endpoint" key: "user:{user_id}:search" limit: 30 window: "1m"

Evaluation order: Most specific rule wins. Check per-endpoint, then per-user, then per-IP, then global.


Step 7: Handling Race Conditions

The Lost Update Problem

Without atomicity:
  Server A: READ count = 99  (limit = 100)
  Server B: READ count = 99
  Server A: WRITE count = 100 (allowed)
  Server B: WRITE count = 100 (also allowed!)
  Result: 101 requests processed (over limit)

Solutions

ApproachHowTrade-off
Redis Lua scriptsAtomic read-modify-writeRequires Redis, adds network hop
Redis MULTI/EXECTransaction blockLess flexible than Lua
Local + syncLocal counter synced periodicallySlightly over-limit during sync gaps
Sorted setsZRANGEBYSCORE + ZCARD atomicHigher memory for sliding window

Step 8: Rate Limiting Dimensions

DimensionKey PatternUse Case
User IDuser:12345Authenticated API usage
IP Addressip:1.2.3.4Anonymous/unauthenticated
API Keykey:abc123Third-party integrations
Endpointuser:12345:/api/searchExpensive operations
Tenanttenant:acmeMulti-tenant SaaS
GlobalglobalOverall system protection

Layered limiting: Apply multiple limits simultaneously. A request must pass ALL applicable limits.


Step 9: HTTP Response Design

When Allowed

HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 742
X-RateLimit-Reset: 1710500400

When Rate Limited

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1710500400
Retry-After: 30
Content-Type: application/json

{
  "error": "rate_limit_exceeded",
  "message": "Rate limit exceeded. Try again in 30 seconds.",
  "retry_after": 30
}

Key headers:

HeaderPurpose
X-RateLimit-LimitTotal allowed requests in window
X-RateLimit-RemainingRequests remaining in current window
X-RateLimit-ResetUnix timestamp when the window resets
Retry-AfterSeconds until the client should retry

Step 10: Monitoring

Dashboard:
+-----------------------------------+-----------------------------------+
| Rate Limited Requests/sec         | Top Rate-Limited Users            |
| [line chart by endpoint]          | user:12345  - 4,523 rejections   |
|                                   | user:67890  - 2,100 rejections   |
+-----------------------------------+-----------------------------------+
| Rejection Rate by Rule            | Redis Latency (p50/p99)          |
| Login: 12.3%                      | p50: 0.2ms  p99: 0.8ms          |
| Search: 3.1%                      | [line chart]                     |
| Global: 0.01%                     |                                  |
+-----------------------------------+-----------------------------------+

Key metrics to track:

  • Total requests allowed vs rejected (by rule, by dimension)
  • Redis latency for rate limit checks
  • False positive rate (legitimate users being limited)
  • Rule hit distribution (which rules trigger most?)

Step 11: Performance Considerations

ConcernSolution
Redis latencyUse Redis cluster, co-locate with app servers
Redis failureFail-open (allow all traffic if Redis is down)
Hot keysShard by user/IP hash across Redis nodes
Lua script CPUKeep scripts simple, avoid loops
Network overheadBatch multiple limit checks in one roundtrip
Local cachingCache "definitely under limit" locally for short TTL

Hybrid Approach (Local + Centralized)

Request --> [Local counter (fast, approximate)]
              |
        Under local threshold? --> Allow immediately
              |
        Near threshold? --> [Check Redis (precise)]
              |
        Over limit? --> Reject

This reduces Redis calls by 80%+ for well-behaved clients.


Edge Cases

ScenarioHandling
Clock skew across serversUse Redis server time (consistent)
Rate limiter becomes bottleneckFail-open, alert, auto-scale Redis
Distributed client (many IPs)Rate limit by user ID, not just IP
Legitimate burst (product launch)Temporary rule override, higher limits
VPN/proxy users sharing IPCombine IP + user-agent + fingerprint

Interview Walkthrough Checklist

[ ] Clarify requirements (which dimension, how strict, distributed?)
[ ] Choose placement (API gateway / middleware)
[ ] Explain algorithms (token bucket vs sliding window)
[ ] Justify choice for the scenario
[ ] Design distributed solution with Redis + Lua
[ ] Draw system architecture
[ ] HTTP response design (429, headers)
[ ] Handle race conditions (atomic operations)
[ ] Monitoring and alerting
[ ] Edge cases and failure modes

Interview Tips

  1. Know at least two algorithms cold and their trade-offs (token bucket + sliding window counter).
  2. Redis + Lua is the standard distributed answer. Explain why atomicity matters.
  3. Fail-open vs fail-closed is a critical decision. Most systems fail-open (allow traffic if rate limiter is down).
  4. HTTP 429 + headers shows API design maturity. Don't forget Retry-After.
  5. Layered limits (global + per-user + per-endpoint) shows depth.
  6. Monitoring -- if you can't see what the rate limiter is doing, you can't tune it.

Resources

  • System Design Interview (Alex Xu): Chapter 4 - Design a Rate Limiter
  • Stripe Engineering Blog: "Scaling API Rate Limiting at Stripe"
  • Cloudflare Blog: "How We Built Rate Limiting"
  • Redis documentation: Rate limiting patterns
  • Kong: Rate Limiting plugin architecture
  • Google Cloud: "Rate Limiting Strategies and Techniques"

Previous: 27 - Design URL Shortener | Next: 29 - Design Key-Value Store