30 - Design Notification System
Previous: 29 - Design Key-Value Store | Next: 31 - Design Chat System
Why This Matters in Interviews
Notification systems touch every FAANG product: order confirmations, friend requests, OTP codes, marketing campaigns. This question tests your ability to design a multi-channel delivery pipeline with reliability guarantees, user preferences, priority handling, and de-duplication at scale.
1. Requirements
Functional Requirements
| Feature | Details |
|---|
| Push notifications | iOS (APNs), Android (FCM) |
| SMS | OTP, account alerts |
| Email | Transactional and marketing |
| In-app | Real-time feed within the product |
| User preferences | Opt-in/out per channel, quiet hours, frequency caps |
| Templating | Reusable templates with variable substitution |
| Scheduling | Send-at-time for scheduled notifications |
| Analytics | Sent, delivered, opened, clicked tracking |
Non-Functional Requirements
| Requirement | Target |
|---|
| Latency | < 5s for critical (OTP), < 30s for standard |
| Throughput | 10M+ notifications/day |
| Availability | 99.99% for critical notifications |
| Reliability | At-least-once delivery, exactly-once appearance to user |
| Ordering | Best-effort ordering per user (not strict) |
2. Notification Types and Providers
| Channel | Provider | Protocol | Delivery Guarantee |
|---|
| iOS Push | Apple APNs | HTTP/2 | Best-effort (device must be reachable) |
| Android Push | Google FCM | HTTP/2 | Best-effort (device must be reachable) |
| SMS | Twilio, Vonage, AWS SNS | REST API | Carrier-dependent delivery |
| Email | SendGrid, AWS SES, Mailgun | SMTP / REST | Delivery + bounce tracking |
| In-app | Internal WebSocket / polling | WebSocket/SSE | Guaranteed (stored server-side) |
3. High-Level Architecture
+------------------+
| Event Sources | (Order Service, Auth Service, Marketing, Cron)
+--------+---------+
|
| (trigger notification event)
v
+--------+---------+
| Notification |
| Service (API) |
| - Validate event |
| - Enrich context |
+--------+---------+
|
v
+--------+---------+
| User Preference | (opt-in/out, quiet hours, frequency caps)
| Check |
+--------+---------+
|
(filtered)
v
+--------+---------+
| Rate Limiter | (per-user, per-channel limits)
+--------+---------+
|
v
+--------+---------+
| Template Engine | "Hi {{name}}, your order {{order_id}} has shipped"
+--------+---------+
|
v
+--------+---------+
| Priority Queue | (Kafka topics partitioned by priority)
| P0: critical | OTP, security alerts
| P1: high | Payment, order updates
| P2: medium | Social (likes, comments)
| P3: low | Marketing, recommendations
+--------+---------+
|
+----------+-----------+-----------+
| | | |
+----+---+ +----+---+ +----+---+ +----+----+
| Push | | SMS | | Email | | In-App |
| Worker | | Worker | | Worker | | Worker |
+----+---+ +----+---+ +----+---+ +----+----+
| | | |
+----+---+ +----+---+ +----+---+ +----+----+
| APNs / | | Twilio | |SendGrid| | WebSocket|
| FCM | | | | / SES | | / Store |
+--------+ +--------+ +--------+ +---------+
4. Detailed Notification Flow
Step 1: Event Trigger
Order Service --> POST /api/v1/notifications
{
"event_type": "order_shipped",
"user_id": "u-12345",
"data": { "order_id": "ord-789", "tracking_url": "..." },
"channels": ["push", "email"], // requested channels
"priority": "high",
"idempotency_key": "order-shipped-ord-789"
}
Step 2: Notification Service
- Validate payload and event_type
- Check idempotency_key (have we processed this before?)
- Look up user profile: name, email, device tokens, phone
- Look up user preferences: channels enabled, quiet hours
Step 3: Preference Filter
User preferences for u-12345:
push: enabled
email: enabled, but quiet hours 22:00-08:00
sms: disabled
Current time: 23:15 --> email deferred to 08:00
Result: push now, email scheduled for 08:00
Step 4: Rate Limit Check
User u-12345 received 4 push notifications in the last hour.
Limit: 5/hour --> ALLOW (1 remaining)
Step 5: Template Rendering
Template: "order_shipped"
"Hi {{name}}, your order {{order_id}} has shipped! Track it here: {{tracking_url}}"
-->
"Hi Zineddine, your order ord-789 has shipped! Track it here: https://..."
Step 6: Enqueue
Push notification --> Kafka topic: notifications.push.high
Email (scheduled) --> DB row with send_at: 2024-03-16T08:00:00Z
Step 7: Worker Delivery
Push Worker picks from Kafka --> Calls FCM API --> ACK on success
Email Scheduler (cron) picks rows where send_at <= now --> Email Worker --> SES
5. Provider Integration
Push Notifications (APNs + FCM)
Push Worker:
1. Fetch device tokens for user from Device Registry
2. Build platform-specific payload:
APNs payload (iOS):
{
"aps": {
"alert": { "title": "Order Shipped", "body": "Your order..." },
"badge": 3,
"sound": "default"
},
"order_id": "ord-789"
}
FCM payload (Android):
{
"message": {
"token": "device_token_here",
"notification": { "title": "Order Shipped", "body": "Your order..." },
"data": { "order_id": "ord-789" }
}
}
3. Send via HTTP/2 to APNs / FCM
4. Handle responses:
- Success: mark delivered
- Invalid token: remove from Device Registry
- Rate limited: retry with backoff
- Server error: retry with backoff
Device Token Management
+-------------------+
| Device Registry |
+-------------------+
| user_id | VARCHAR |
| token | VARCHAR |
| platform| ENUM | (ios, android, web)
| app_ver | VARCHAR |
| updated | TIMESTAMP|
+-------------------+
On app launch: register/update token
On uninstall: APNs/FCM returns "invalid token" --> delete
On logout: remove token association
Email (SES / SendGrid)
Email Worker:
1. Render HTML template (MJML or server-side)
2. Attach tracking pixel for open tracking
3. Rewrite links for click tracking
4. Send via SES/SendGrid API
5. Process webhooks for:
- Delivered
- Opened (tracking pixel loaded)
- Clicked (link redirect)
- Bounced (hard/soft)
- Complained (spam report) --> auto-unsubscribe user
SMS (Twilio)
SMS Worker:
1. Format message (160 char limit for SMS, or multi-part)
2. Send via Twilio API
3. Process status callbacks:
- Queued -> Sent -> Delivered
- Failed (invalid number, carrier rejection)
6. User Preferences
Schema
notification_preferences:
+----------+----------+----------+---------+-----------+----------+
| user_id | channel | enabled | quiet | quiet | freq_cap |
| | | | start | end | (per hr) |
+----------+----------+----------+---------+-----------+----------+
| u-12345 | push | true | null | null | 5 |
| u-12345 | email | true | 22:00 | 08:00 | 3 |
| u-12345 | sms | false | null | null | 1 |
| u-12345 | in_app | true | null | null | null |
+----------+----------+----------+---------+-----------+----------+
Fine-grained preferences (optional):
+----------+-------------------+----------+
| user_id | notification_type | enabled |
+----------+-------------------+----------+
| u-12345 | marketing | false |
| u-12345 | order_updates | true |
| u-12345 | social_activity | true |
+----------+-------------------+----------+
Preference Check Logic
function should_send(user_id, channel, notification_type, current_time):
prefs = get_preferences(user_id, channel)
// Channel disabled?
if not prefs.enabled:
return SKIP
// Notification type disabled?
if not type_preference(user_id, notification_type).enabled:
return SKIP
// Quiet hours?
if prefs.quiet_start and in_quiet_hours(current_time, prefs):
return DEFER(next_available_time)
// Frequency cap?
if recent_count(user_id, channel, 1_hour) >= prefs.freq_cap:
return SKIP // or DEFER
return SEND
7. Priority Queue Design
Kafka Topics (partitioned by user_id for ordering):
notifications.push.critical --> Consumer Group: push-workers (dedicated)
notifications.push.high --> Consumer Group: push-workers
notifications.push.medium --> Consumer Group: push-workers
notifications.push.low --> Consumer Group: push-workers (lowest priority)
notifications.email.critical --> Consumer Group: email-workers
notifications.email.high --> Consumer Group: email-workers
...
Priority enforcement:
Workers poll critical topics first, then high, then medium, then low.
Under load, low-priority notifications are naturally delayed.
| Priority | Examples | Max Latency | Retry Attempts |
|---|
| P0 Critical | OTP, security alerts, password reset | < 5s | 5 |
| P1 High | Payment confirmation, order shipped | < 30s | 3 |
| P2 Medium | Friend request, comment, like | < 5min | 3 |
| P3 Low | Marketing, weekly digest, recommendations | < 1hr | 1 |
8. Retry and Dead Letter Queue
Worker attempts delivery:
Attempt 1: FAIL (provider timeout)
--> wait 1s (base * 2^0)
Attempt 2: FAIL (provider 500)
--> wait 2s (base * 2^1)
Attempt 3: FAIL (provider 500)
--> wait 4s (base * 2^2)
Attempt 4 (max reached):
--> Move to Dead Letter Queue (DLQ)
DLQ Processing:
- Manual inspection dashboard
- Automated retry after provider recovery
- Alert if DLQ depth exceeds threshold
- Discard after 24 hours (notification is stale)
+----------+ +----------+ +----------+
| Main | --> | Retry | --> | Dead |
| Queue | | Queue | | Letter |
| | | (backoff)| | Queue |
+----------+ +----------+ +----------+
9. Deduplication
Why Dedup Matters
Without dedup:
Order Service retries event --> Notification Service processes twice
--> User gets "Your order shipped!" twice
This is annoying and erodes trust.
Implementation
Idempotency key: "order-shipped-ord-789"
On receive:
1. Check Redis/DB: has this idempotency_key been processed?
- YES --> Return 200 OK (already handled), skip processing
- NO --> Process notification, store key with TTL (24h)
Redis command:
SET idempotency:order-shipped-ord-789 1 EX 86400 NX
(NX = set only if not exists, returns null if already set)
10. Scheduled Notifications
+-------------------+
| Scheduled Store |
| (Database) |
+-------------------+
| notification_id |
| user_id |
| channel |
| payload |
| send_at | <-- future timestamp
| status | (pending, sent, cancelled)
| created_at |
+-------------------+
Scheduler Worker (runs every minute):
SELECT * FROM scheduled_notifications
WHERE send_at <= NOW()
AND status = 'pending'
ORDER BY send_at
LIMIT 1000
For each:
1. Publish to notification queue
2. Update status = 'sent'
(Use SELECT ... FOR UPDATE SKIP LOCKED to prevent double-processing)
Use Cases
- "Your trial ends in 3 days" (scheduled at trial_start + 11 days)
- Weekly digest emails (every Monday 9 AM in user's timezone)
- Quiet hours deferral (email deferred from 23:15 to 08:00)
- Marketing campaign (send to 10M users starting at 10 AM EST)
11. Analytics Pipeline
Notification Events:
created --> queued --> sent --> delivered --> opened --> clicked
|
+--> failed --> retried --> sent --> ...
+--> failed --> DLQ
+------------------+ +----------+ +------------------+
| Notification | --> | Kafka | --> | Analytics |
| Workers | | Events | | Consumer |
| (emit events) | | Topic | | (aggregate) |
+------------------+ +----------+ +--------+---------+
|
+--------+---------+
| Analytics DB |
| (ClickHouse) |
+--------+---------+
|
+--------+---------+
| Dashboard |
| - Delivery rate |
| - Open rate |
| - Click rate |
| - Bounce rate |
+------------------+
Key Metrics
| Metric | Formula | Healthy Target |
|---|
| Delivery rate | delivered / sent | > 98% (push), > 95% (email) |
| Open rate | opened / delivered | 15-25% (email), 40-60% (push) |
| Click rate | clicked / opened | 2-5% (email) |
| Bounce rate | bounced / sent | < 2% (email) |
| Unsubscribe rate | unsubscribed / delivered | < 0.5% |
12. Rate Limiting for Notifications
Per-user limits:
Push: 5/hour, 20/day
Email: 3/hour, 10/day
SMS: 2/hour, 5/day (SMS is expensive)
Per-channel global limits:
APNs: stay under provider rate limits
Twilio: account-level TPS limit
Implementation:
Same sliding window counter as 28 - Design Rate Limiter
Key: rate_limit:notif:{user_id}:{channel}:{window}
13. Full System Diagram
+----------------+ +----------------+ +------------------+
| Order Service | | Auth Service | | Marketing Engine |
+-------+--------+ +-------+--------+ +--------+---------+
| | |
+--------------------+--------------------+
|
+--------+--------+
| Notification |
| Service (API) |
| - Auth + validate
| - Idempotency |
+--------+--------+
|
+--------------+--------------+
| | |
+--------+-----+ +-----+------+ +-----+------+
| User Prefs | | Rate | | Template |
| Service | | Limiter | | Engine |
| (filter) | | (throttle) | | (render) |
+--------------+ +------------+ +-----+------+
|
+--------+--------+
| Priority Queues |
| (Kafka) |
+--------+--------+
|
+------------+--------+-------+--------+
| | | |
+------+---+ +-----+--+ +---+----+ +---------+--+
| Push | | SMS | | Email | | In-App |
| Workers | | Workers| | Workers| | Workers |
+------+---+ +-----+--+ +---+----+ +-----+-----+
| | | |
+------+---+ +-----+--+ +---+----+ +------+-----+
| APNs/FCM | | Twilio | | SES/SG | | WebSocket |
+----------+ +--------+ +--------+ | + DB Store |
+------------+
|
+--------+--------+
| DLQ + Retry |
| Workers |
+-----------------+
|
+--------+--------+
| Analytics |
| (ClickHouse) |
+-----------------+
|
+--------+--------+
| Scheduled |
| Notification DB |
| + Cron Worker |
+-----------------+
14. Handling Scale: 10M+ Notifications/Day
| Challenge | Solution |
|---|
| Burst traffic (flash sales) | Kafka absorbs spikes, workers scale horizontally |
| Provider rate limits | Worker-side rate limiting, multiple provider accounts |
| User timezone differences | Scheduled notifications with per-user timezone |
| Template rendering at scale | Pre-compile templates, cache rendered HTML |
| Device token churn | Background job cleans invalid tokens from APNs/FCM feedback |
| Marketing to millions | Batch processing: chunk users, stagger sends over hours |
15. Reliability Considerations
| Concern | Approach |
|---|
| At-least-once delivery | Kafka consumer ACKs after successful provider call |
| Idempotency | Dedup by idempotency key in Redis (24h TTL) |
| Provider outage | Circuit breaker per provider, failover to backup (e.g., SES -> Mailgun) |
| Poison messages | DLQ after N retries, alerting on DLQ depth |
| Data loss | Kafka replication factor 3, DB backups |
| Monitoring | Track end-to-end latency per priority level, alert on delivery rate drops |
Interview Tips
- Start with the flow. Event trigger --> preference check --> rate limit --> template --> queue --> worker --> provider. Walk through it step by step.
- Multi-channel is the point. Don't design for just push. Show you can handle push + SMS + email + in-app with a unified pipeline.
- Priority queues matter. OTP must arrive in seconds; marketing can wait hours. Separate topics/queues by priority.
- Deduplication is critical. Users receiving duplicate notifications is a real production problem. Always mention idempotency keys.
- User preferences show maturity. Opt-in/out, quiet hours, and frequency caps are what differentiate a good answer.
- Analytics is async. Track the full funnel (sent -> delivered -> opened -> clicked) without blocking delivery.
Common Interview Questions
- "How do you ensure a user never receives the same notification twice?"
- "How do you handle provider failures (APNs down)?"
- "How would you send a marketing notification to 100M users?"
- "How do you support user-specific quiet hours across timezones?"
- "What happens if the notification service itself goes down?"
Resources
- System Design Interview (Alex Xu): Chapter 10 - Design a Notification System
- Firebase Cloud Messaging (FCM) Documentation
- Apple Push Notification Service (APNs) Documentation
- Twilio API Documentation
- SendGrid / AWS SES Documentation
- Uber Engineering: "Delivering Billions of Notifications"
- LinkedIn Engineering: "Notification Infrastructure"
Previous: 29 - Design Key-Value Store | Next: 31 - Design Chat System