🔔 Design Notification System — System Design Interview Guide
Medium · Messaging & Fan-Out
Design a notification system that delivers messages via push (iOS/Android), email, and SMS to 100M+ users with reliable delivery, deduplication, and user preference management.
Open the interactive Notification System design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.
Functional requirements
- Send push notifications (iOS APNs, Android FCM)
- Send email notifications (via SendGrid/SES)
- Send SMS notifications (via Twilio)
- User preferences: enable/disable per channel and per type
- Retry failed deliveries with exponential backoff
- Delivery receipts and analytics dashboard
Non-functional requirements & scale
- Send 10M notifications per day across all channels
- Delivery latency: push < 1 sec, email < 5 sec, SMS < 30 sec
- At-least-once delivery guarantee with deduplication
- No notification sent to users who opted out
- System must handle third-party API failures gracefully
- Scale horizontally: add workers to handle spikes
Capacity estimation
10M notifications/day across push, email, SMS. Each channel has different throughput characteristics: push (fast, cheap), email (medium, deliverability matters), SMS (slow, expensive). Fan-out scenarios: one event (new post by celebrity) triggers 10M push notifications.
Core entities
- Notification — notifId, type, channel, userId, payload, status, createdAt, sentAt
- UserDevice — deviceId, userId, platform (ios|android), pushToken, isActive
- UserPreference — userId, channel, notifType, isEnabled
- DeliveryLog — notifId, attempt, status, error?, timestamp
API design
POST /api/v1/notifications/send— Internal API. Body: { userId, type, payload, channels[] }. Enqueues to Kafka.PUT /api/v1/users/me/preferences— Update notification preferences.POST /api/v1/devices/register— Register push token. Body: { platform, pushToken }.GET /api/v1/notifications?cursor=— In-app notification history.
High-level design
Notification trigger → Kafka topic per channel (push/email/sms) → channel workers read queue → check user preferences → call third-party API (APNs/FCM/SendGrid/Twilio) → log delivery result → retry on failure.
Deep dives
🔄 Retry with Backoff
Third-party APIs fail transiently. Strategy: Kafka consumer → call API → on 429/5xx: don't ack message → Kafka redelivers after backoff (1s, 2s, 4s, 8s). After 5 retries: send to Dead Letter Queue (DLQ) for manual inspection. Separate retry topic with delay using Kafka message timestamp. APNs: if token invalid → mark device as inactive, stop retrying.
🚫 Preference Enforcement
Before sending: check UserPreference table. Cache preferences in Redis (TTL 1h) per userId. If opted out: drop message (update delivery log with OPTED_OUT status). GDPR: right to erasure — delete all user data including preferences and device tokens. Global unsubscribe: one click → disable all channels for that user.
📱 Push Token Management
Push tokens expire or change when user reinstalls app. On app open: always register current token. APNs returns 410 (Gone) for invalid tokens → mark device inactive. FCM returns error.code = "INVALID_REGISTRATION" → same. Batch-check stale tokens periodically. Each user may have multiple devices — send to all active devices.
⚡ Fan-Out at Scale
Celebrity posts → trigger 10M notifications. Naive: one Kafka message per user = 10M messages queued instantly. Better: publish one "fan-out" event, fan-out worker reads follower list and enqueues individual notifications in batches of 1000. Rate-limit per channel (APNs: 300K/sec burst). Use multiple Kafka partitions per channel (push: 100 partitions for 100 parallel workers).
Scaling considerations
- Separate Kafka topics per channel: push/email/sms — independent scaling
- Worker fleet auto-scales with Kafka consumer lag metrics
- Redis for preference cache reduces DB load on every notification
- DLQ monitoring with alerting for high failure rates per provider
- Cost optimization: batch email workers (SES cost = per 1000 emails)
What interviewers expect by level
- Junior: Describe the notification flow through Kafka to third-party APIs. Know push vs email vs SMS differences.
- Mid: Retry with exponential backoff, preference enforcement, DLQ, push token lifecycle.
- Senior: Fan-out architecture for celebrity events, multi-provider failover, batch optimization, rate limiting per provider.
- Staff: Global notification routing (US/EU APNs endpoints), cost per notification optimization, GDPR compliance pipeline.
Practice more system design case studies
- Design URL Shortener
- Design Social Media Feed
- Design Chat System
- Design Video Streaming
- Design Ride-Sharing Platform
- Design E-Commerce Platform
- Design UPI Payment Gateway
- Design Google Docs
- Design Tinder
- Design Google Drive / Dropbox
- Design Instagram
- Design Type-Ahead Search
- Design Web Crawler
- Design Ticket Booking (BookMyShow)
- Design Pastebin
- Design Rate Limiter (Standalone)
- Design Simple Web App
- Design Food Delivery (Swiggy)
- Design Stock Trading System
- Design Live Streaming (Twitch)
- Design Distributed Key-Value Store
- Design Ad Click Aggregation
- Design Monitoring / Metrics (Datadog)
- Design Online Judge (LeetCode)
- Design FB Post Search
- Design Yelp
- Design Cache Layer
- Design Message Queue
- Design Full Production Stack
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…