🔔 Design Notification System — System Design Interview Guide

Medium · Messaging & Fan-Out

Design a notification system that delivers messages via push (iOS/Android), email, and SMS to 100M+ users with reliable delivery, deduplication, and user preference management.

Open the interactive Notification System design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

Send push notifications (iOS APNs, Android FCM)
Send email notifications (via SendGrid/SES)
Send SMS notifications (via Twilio)
User preferences: enable/disable per channel and per type
Retry failed deliveries with exponential backoff
Delivery receipts and analytics dashboard

Non-functional requirements & scale

Send 10M notifications per day across all channels
Delivery latency: push < 1 sec, email < 5 sec, SMS < 30 sec
At-least-once delivery guarantee with deduplication
No notification sent to users who opted out
System must handle third-party API failures gracefully
Scale horizontally: add workers to handle spikes

Capacity estimation

10M notifications/day across push, email, SMS. Each channel has different throughput characteristics: push (fast, cheap), email (medium, deliverability matters), SMS (slow, expensive). Fan-out scenarios: one event (new post by celebrity) triggers 10M push notifications.

Core entities

Notification — notifId, type, channel, userId, payload, status, createdAt, sentAt
UserDevice — deviceId, userId, platform (ios|android), pushToken, isActive
UserPreference — userId, channel, notifType, isEnabled
DeliveryLog — notifId, attempt, status, error?, timestamp

API design

POST /api/v1/notifications/send — Internal API. Body: { userId, type, payload, channels[] }. Enqueues to Kafka.
PUT /api/v1/users/me/preferences — Update notification preferences.
POST /api/v1/devices/register — Register push token. Body: { platform, pushToken }.
GET /api/v1/notifications?cursor= — In-app notification history.

High-level design

Notification trigger → Kafka topic per channel (push/email/sms) → channel workers read queue → check user preferences → call third-party API (APNs/FCM/SendGrid/Twilio) → log delivery result → retry on failure.

Deep dives

🔄 Retry with Backoff

Third-party APIs fail transiently. Strategy: Kafka consumer → call API → on 429/5xx: don't ack message → Kafka redelivers after backoff (1s, 2s, 4s, 8s). After 5 retries: send to Dead Letter Queue (DLQ) for manual inspection. Separate retry topic with delay using Kafka message timestamp. APNs: if token invalid → mark device as inactive, stop retrying.

🚫 Preference Enforcement

Before sending: check UserPreference table. Cache preferences in Redis (TTL 1h) per userId. If opted out: drop message (update delivery log with OPTED_OUT status). GDPR: right to erasure — delete all user data including preferences and device tokens. Global unsubscribe: one click → disable all channels for that user.

📱 Push Token Management

Push tokens expire or change when user reinstalls app. On app open: always register current token. APNs returns 410 (Gone) for invalid tokens → mark device inactive. FCM returns error.code = "INVALID_REGISTRATION" → same. Batch-check stale tokens periodically. Each user may have multiple devices — send to all active devices.

⚡ Fan-Out at Scale

Celebrity posts → trigger 10M notifications. Naive: one Kafka message per user = 10M messages queued instantly. Better: publish one "fan-out" event, fan-out worker reads follower list and enqueues individual notifications in batches of 1000. Rate-limit per channel (APNs: 300K/sec burst). Use multiple Kafka partitions per channel (push: 100 partitions for 100 parallel workers).

Scaling considerations

Separate Kafka topics per channel: push/email/sms — independent scaling
Worker fleet auto-scales with Kafka consumer lag metrics
Redis for preference cache reduces DB load on every notification
DLQ monitoring with alerting for high failure rates per provider
Cost optimization: batch email workers (SES cost = per 1000 emails)

What interviewers expect by level

Junior: Describe the notification flow through Kafka to third-party APIs. Know push vs email vs SMS differences.
Mid: Retry with exponential backoff, preference enforcement, DLQ, push token lifecycle.
Senior: Fan-out architecture for celebrity events, multi-provider failover, batch optimization, rate limiting per provider.
Staff: Global notification routing (US/EU APNs endpoints), cost per notification optimization, GDPR compliance pipeline.

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…