30 - Design Notification System

Previous: 29 - Design Key-Value Store | Next: 31 - Design Chat System


Why This Matters in Interviews

Notification systems touch every FAANG product: order confirmations, friend requests, OTP codes, marketing campaigns. This question tests your ability to design a multi-channel delivery pipeline with reliability guarantees, user preferences, priority handling, and de-duplication at scale.


1. Requirements

Functional Requirements

FeatureDetails
Push notificationsiOS (APNs), Android (FCM)
SMSOTP, account alerts
EmailTransactional and marketing
In-appReal-time feed within the product
User preferencesOpt-in/out per channel, quiet hours, frequency caps
TemplatingReusable templates with variable substitution
SchedulingSend-at-time for scheduled notifications
AnalyticsSent, delivered, opened, clicked tracking

Non-Functional Requirements

RequirementTarget
Latency< 5s for critical (OTP), < 30s for standard
Throughput10M+ notifications/day
Availability99.99% for critical notifications
ReliabilityAt-least-once delivery, exactly-once appearance to user
OrderingBest-effort ordering per user (not strict)

2. Notification Types and Providers

ChannelProviderProtocolDelivery Guarantee
iOS PushApple APNsHTTP/2Best-effort (device must be reachable)
Android PushGoogle FCMHTTP/2Best-effort (device must be reachable)
SMSTwilio, Vonage, AWS SNSREST APICarrier-dependent delivery
EmailSendGrid, AWS SES, MailgunSMTP / RESTDelivery + bounce tracking
In-appInternal WebSocket / pollingWebSocket/SSEGuaranteed (stored server-side)

3. High-Level Architecture

+------------------+
| Event Sources    |  (Order Service, Auth Service, Marketing, Cron)
+--------+---------+
         |
         | (trigger notification event)
         v
+--------+---------+
| Notification     |
| Service (API)    |
| - Validate event |
| - Enrich context |
+--------+---------+
         |
         v
+--------+---------+
| User Preference  |  (opt-in/out, quiet hours, frequency caps)
| Check            |
+--------+---------+
         |
    (filtered)
         v
+--------+---------+
| Rate Limiter     |  (per-user, per-channel limits)
+--------+---------+
         |
         v
+--------+---------+
| Template Engine  |  "Hi {{name}}, your order {{order_id}} has shipped"
+--------+---------+
         |
         v
+--------+---------+
| Priority Queue   |  (Kafka topics partitioned by priority)
| P0: critical     |  OTP, security alerts
| P1: high         |  Payment, order updates
| P2: medium       |  Social (likes, comments)
| P3: low          |  Marketing, recommendations
+--------+---------+
         |
         +----------+-----------+-----------+
         |          |           |           |
    +----+---+ +----+---+ +----+---+ +----+----+
    | Push   | | SMS    | | Email  | | In-App  |
    | Worker | | Worker | | Worker | | Worker  |
    +----+---+ +----+---+ +----+---+ +----+----+
         |          |           |           |
    +----+---+ +----+---+ +----+---+ +----+----+
    | APNs / | | Twilio | |SendGrid| | WebSocket|
    | FCM    | |        | | / SES  | | / Store  |
    +--------+ +--------+ +--------+ +---------+

4. Detailed Notification Flow

Step 1: Event Trigger
  Order Service --> POST /api/v1/notifications
  {
    "event_type": "order_shipped",
    "user_id": "u-12345",
    "data": { "order_id": "ord-789", "tracking_url": "..." },
    "channels": ["push", "email"],        // requested channels
    "priority": "high",
    "idempotency_key": "order-shipped-ord-789"
  }

Step 2: Notification Service
  - Validate payload and event_type
  - Check idempotency_key (have we processed this before?)
  - Look up user profile: name, email, device tokens, phone
  - Look up user preferences: channels enabled, quiet hours

Step 3: Preference Filter
  User preferences for u-12345:
    push: enabled
    email: enabled, but quiet hours 22:00-08:00
    sms: disabled

  Current time: 23:15 --> email deferred to 08:00
  Result: push now, email scheduled for 08:00

Step 4: Rate Limit Check
  User u-12345 received 4 push notifications in the last hour.
  Limit: 5/hour --> ALLOW (1 remaining)

Step 5: Template Rendering
  Template: "order_shipped"
  "Hi {{name}}, your order {{order_id}} has shipped! Track it here: {{tracking_url}}"
  -->
  "Hi Zineddine, your order ord-789 has shipped! Track it here: https://..."

Step 6: Enqueue
  Push notification --> Kafka topic: notifications.push.high
  Email (scheduled) --> DB row with send_at: 2024-03-16T08:00:00Z

Step 7: Worker Delivery
  Push Worker picks from Kafka --> Calls FCM API --> ACK on success
  Email Scheduler (cron) picks rows where send_at <= now --> Email Worker --> SES

5. Provider Integration

Push Notifications (APNs + FCM)

Push Worker:
  1. Fetch device tokens for user from Device Registry
  2. Build platform-specific payload:

APNs payload (iOS):
  {
    "aps": {
      "alert": { "title": "Order Shipped", "body": "Your order..." },
      "badge": 3,
      "sound": "default"
    },
    "order_id": "ord-789"
  }

FCM payload (Android):
  {
    "message": {
      "token": "device_token_here",
      "notification": { "title": "Order Shipped", "body": "Your order..." },
      "data": { "order_id": "ord-789" }
    }
  }

  3. Send via HTTP/2 to APNs / FCM
  4. Handle responses:
     - Success: mark delivered
     - Invalid token: remove from Device Registry
     - Rate limited: retry with backoff
     - Server error: retry with backoff

Device Token Management

+-------------------+
| Device Registry   |
+-------------------+
| user_id | VARCHAR |
| token   | VARCHAR |
| platform| ENUM    |  (ios, android, web)
| app_ver | VARCHAR |
| updated | TIMESTAMP|
+-------------------+

On app launch: register/update token
On uninstall: APNs/FCM returns "invalid token" --> delete
On logout: remove token association

Email (SES / SendGrid)

Email Worker:
  1. Render HTML template (MJML or server-side)
  2. Attach tracking pixel for open tracking
  3. Rewrite links for click tracking
  4. Send via SES/SendGrid API
  5. Process webhooks for:
     - Delivered
     - Opened (tracking pixel loaded)
     - Clicked (link redirect)
     - Bounced (hard/soft)
     - Complained (spam report) --> auto-unsubscribe user

SMS (Twilio)

SMS Worker:
  1. Format message (160 char limit for SMS, or multi-part)
  2. Send via Twilio API
  3. Process status callbacks:
     - Queued -> Sent -> Delivered
     - Failed (invalid number, carrier rejection)

6. User Preferences

Schema

notification_preferences:
+----------+----------+----------+---------+-----------+----------+
| user_id  | channel  | enabled  | quiet   | quiet     | freq_cap |
|          |          |          | start   | end       | (per hr) |
+----------+----------+----------+---------+-----------+----------+
| u-12345  | push     | true     | null    | null      | 5        |
| u-12345  | email    | true     | 22:00   | 08:00     | 3        |
| u-12345  | sms      | false    | null    | null      | 1        |
| u-12345  | in_app   | true     | null    | null      | null     |
+----------+----------+----------+---------+-----------+----------+

Fine-grained preferences (optional):
+----------+-------------------+----------+
| user_id  | notification_type | enabled  |
+----------+-------------------+----------+
| u-12345  | marketing         | false    |
| u-12345  | order_updates     | true     |
| u-12345  | social_activity   | true     |
+----------+-------------------+----------+

Preference Check Logic

function should_send(user_id, channel, notification_type, current_time):
    prefs = get_preferences(user_id, channel)

    // Channel disabled?
    if not prefs.enabled:
        return SKIP

    // Notification type disabled?
    if not type_preference(user_id, notification_type).enabled:
        return SKIP

    // Quiet hours?
    if prefs.quiet_start and in_quiet_hours(current_time, prefs):
        return DEFER(next_available_time)

    // Frequency cap?
    if recent_count(user_id, channel, 1_hour) >= prefs.freq_cap:
        return SKIP  // or DEFER

    return SEND

7. Priority Queue Design

Kafka Topics (partitioned by user_id for ordering):

  notifications.push.critical   --> Consumer Group: push-workers (dedicated)
  notifications.push.high       --> Consumer Group: push-workers
  notifications.push.medium     --> Consumer Group: push-workers
  notifications.push.low        --> Consumer Group: push-workers (lowest priority)

  notifications.email.critical  --> Consumer Group: email-workers
  notifications.email.high      --> Consumer Group: email-workers
  ...

Priority enforcement:
  Workers poll critical topics first, then high, then medium, then low.
  Under load, low-priority notifications are naturally delayed.
PriorityExamplesMax LatencyRetry Attempts
P0 CriticalOTP, security alerts, password reset< 5s5
P1 HighPayment confirmation, order shipped< 30s3
P2 MediumFriend request, comment, like< 5min3
P3 LowMarketing, weekly digest, recommendations< 1hr1

8. Retry and Dead Letter Queue

Worker attempts delivery:

Attempt 1: FAIL (provider timeout)
  --> wait 1s (base * 2^0)
Attempt 2: FAIL (provider 500)
  --> wait 2s (base * 2^1)
Attempt 3: FAIL (provider 500)
  --> wait 4s (base * 2^2)
Attempt 4 (max reached):
  --> Move to Dead Letter Queue (DLQ)

DLQ Processing:
  - Manual inspection dashboard
  - Automated retry after provider recovery
  - Alert if DLQ depth exceeds threshold
  - Discard after 24 hours (notification is stale)

+----------+     +----------+     +----------+
| Main     | --> | Retry    | --> | Dead     |
| Queue    |     | Queue    |     | Letter   |
|          |     | (backoff)|     | Queue    |
+----------+     +----------+     +----------+

9. Deduplication

Why Dedup Matters

Without dedup:
  Order Service retries event --> Notification Service processes twice
  --> User gets "Your order shipped!" twice

This is annoying and erodes trust.

Implementation

Idempotency key: "order-shipped-ord-789"

On receive:
  1. Check Redis/DB: has this idempotency_key been processed?
     - YES --> Return 200 OK (already handled), skip processing
     - NO  --> Process notification, store key with TTL (24h)

Redis command:
  SET idempotency:order-shipped-ord-789 1 EX 86400 NX
  (NX = set only if not exists, returns null if already set)

10. Scheduled Notifications

+-------------------+
| Scheduled Store   |
| (Database)        |
+-------------------+
| notification_id   |
| user_id           |
| channel           |
| payload           |
| send_at           |  <-- future timestamp
| status            |  (pending, sent, cancelled)
| created_at        |
+-------------------+

Scheduler Worker (runs every minute):
  SELECT * FROM scheduled_notifications
  WHERE send_at <= NOW()
    AND status = 'pending'
  ORDER BY send_at
  LIMIT 1000

  For each:
    1. Publish to notification queue
    2. Update status = 'sent'
    (Use SELECT ... FOR UPDATE SKIP LOCKED to prevent double-processing)

Use Cases

  • "Your trial ends in 3 days" (scheduled at trial_start + 11 days)
  • Weekly digest emails (every Monday 9 AM in user's timezone)
  • Quiet hours deferral (email deferred from 23:15 to 08:00)
  • Marketing campaign (send to 10M users starting at 10 AM EST)

11. Analytics Pipeline

Notification Events:
  created --> queued --> sent --> delivered --> opened --> clicked
                          |
                          +--> failed --> retried --> sent --> ...
                          +--> failed --> DLQ

+------------------+     +----------+     +------------------+
| Notification     | --> | Kafka    | --> | Analytics        |
| Workers          |     | Events   |     | Consumer         |
| (emit events)    |     | Topic    |     | (aggregate)      |
+------------------+     +----------+     +--------+---------+
                                                   |
                                          +--------+---------+
                                          | Analytics DB     |
                                          | (ClickHouse)     |
                                          +--------+---------+
                                                   |
                                          +--------+---------+
                                          | Dashboard        |
                                          | - Delivery rate  |
                                          | - Open rate      |
                                          | - Click rate     |
                                          | - Bounce rate    |
                                          +------------------+

Key Metrics

MetricFormulaHealthy Target
Delivery ratedelivered / sent> 98% (push), > 95% (email)
Open rateopened / delivered15-25% (email), 40-60% (push)
Click rateclicked / opened2-5% (email)
Bounce ratebounced / sent< 2% (email)
Unsubscribe rateunsubscribed / delivered< 0.5%

12. Rate Limiting for Notifications

Per-user limits:
  Push: 5/hour, 20/day
  Email: 3/hour, 10/day
  SMS: 2/hour, 5/day (SMS is expensive)

Per-channel global limits:
  APNs: stay under provider rate limits
  Twilio: account-level TPS limit

Implementation:
  Same sliding window counter as 28 - Design Rate Limiter
  Key: rate_limit:notif:{user_id}:{channel}:{window}

13. Full System Diagram

+----------------+  +----------------+  +------------------+
| Order Service  |  | Auth Service   |  | Marketing Engine |
+-------+--------+  +-------+--------+  +--------+---------+
        |                    |                    |
        +--------------------+--------------------+
                             |
                    +--------+--------+
                    | Notification    |
                    | Service (API)   |
                    | - Auth + validate
                    | - Idempotency   |
                    +--------+--------+
                             |
              +--------------+--------------+
              |              |              |
     +--------+-----+ +-----+------+ +-----+------+
     | User Prefs   | | Rate       | | Template   |
     | Service      | | Limiter    | | Engine     |
     | (filter)     | | (throttle) | | (render)   |
     +--------------+ +------------+ +-----+------+
                                            |
                                   +--------+--------+
                                   | Priority Queues |
                                   | (Kafka)         |
                                   +--------+--------+
                                            |
              +------------+--------+-------+--------+
              |            |        |                |
       +------+---+ +-----+--+ +---+----+ +---------+--+
       | Push     | | SMS    | | Email  | | In-App     |
       | Workers  | | Workers| | Workers| | Workers    |
       +------+---+ +-----+--+ +---+----+ +-----+-----+
              |            |        |              |
       +------+---+ +-----+--+ +---+----+ +------+-----+
       | APNs/FCM | | Twilio | | SES/SG | | WebSocket  |
       +----------+ +--------+ +--------+ | + DB Store  |
                                           +------------+
                             |
                    +--------+--------+
                    | DLQ + Retry     |
                    | Workers         |
                    +-----------------+
                             |
                    +--------+--------+
                    | Analytics       |
                    | (ClickHouse)    |
                    +-----------------+
                             |
                    +--------+--------+
                    | Scheduled       |
                    | Notification DB |
                    | + Cron Worker   |
                    +-----------------+

14. Handling Scale: 10M+ Notifications/Day

ChallengeSolution
Burst traffic (flash sales)Kafka absorbs spikes, workers scale horizontally
Provider rate limitsWorker-side rate limiting, multiple provider accounts
User timezone differencesScheduled notifications with per-user timezone
Template rendering at scalePre-compile templates, cache rendered HTML
Device token churnBackground job cleans invalid tokens from APNs/FCM feedback
Marketing to millionsBatch processing: chunk users, stagger sends over hours

15. Reliability Considerations

ConcernApproach
At-least-once deliveryKafka consumer ACKs after successful provider call
IdempotencyDedup by idempotency key in Redis (24h TTL)
Provider outageCircuit breaker per provider, failover to backup (e.g., SES -> Mailgun)
Poison messagesDLQ after N retries, alerting on DLQ depth
Data lossKafka replication factor 3, DB backups
MonitoringTrack end-to-end latency per priority level, alert on delivery rate drops

Interview Tips

  1. Start with the flow. Event trigger --> preference check --> rate limit --> template --> queue --> worker --> provider. Walk through it step by step.
  2. Multi-channel is the point. Don't design for just push. Show you can handle push + SMS + email + in-app with a unified pipeline.
  3. Priority queues matter. OTP must arrive in seconds; marketing can wait hours. Separate topics/queues by priority.
  4. Deduplication is critical. Users receiving duplicate notifications is a real production problem. Always mention idempotency keys.
  5. User preferences show maturity. Opt-in/out, quiet hours, and frequency caps are what differentiate a good answer.
  6. Analytics is async. Track the full funnel (sent -> delivered -> opened -> clicked) without blocking delivery.

Common Interview Questions

  • "How do you ensure a user never receives the same notification twice?"
  • "How do you handle provider failures (APNs down)?"
  • "How would you send a marketing notification to 100M users?"
  • "How do you support user-specific quiet hours across timezones?"
  • "What happens if the notification service itself goes down?"

Resources

  • System Design Interview (Alex Xu): Chapter 10 - Design a Notification System
  • Firebase Cloud Messaging (FCM) Documentation
  • Apple Push Notification Service (APNs) Documentation
  • Twilio API Documentation
  • SendGrid / AWS SES Documentation
  • Uber Engineering: "Delivering Billions of Notifications"
  • LinkedIn Engineering: "Notification Infrastructure"

Previous: 29 - Design Key-Value Store | Next: 31 - Design Chat System