🔔 Design Notification System — System Design Interview Guide

Medium · Messaging & Fan-Out

Design a notification system that delivers messages via push (iOS/Android), email, and SMS to 100M+ users with reliable delivery, deduplication, and user preference management.

Open the interactive Notification System design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

Non-functional requirements & scale

Capacity estimation

10M notifications/day across push, email, SMS. Each channel has different throughput characteristics: push (fast, cheap), email (medium, deliverability matters), SMS (slow, expensive). Fan-out scenarios: one event (new post by celebrity) triggers 10M push notifications.

Core entities

API design

High-level design

Notification trigger → Kafka topic per channel (push/email/sms) → channel workers read queue → check user preferences → call third-party API (APNs/FCM/SendGrid/Twilio) → log delivery result → retry on failure.

Deep dives

🔄 Retry with Backoff

Third-party APIs fail transiently. Strategy: Kafka consumer → call API → on 429/5xx: don't ack message → Kafka redelivers after backoff (1s, 2s, 4s, 8s). After 5 retries: send to Dead Letter Queue (DLQ) for manual inspection. Separate retry topic with delay using Kafka message timestamp. APNs: if token invalid → mark device as inactive, stop retrying.

🚫 Preference Enforcement

Before sending: check UserPreference table. Cache preferences in Redis (TTL 1h) per userId. If opted out: drop message (update delivery log with OPTED_OUT status). GDPR: right to erasure — delete all user data including preferences and device tokens. Global unsubscribe: one click → disable all channels for that user.

📱 Push Token Management

Push tokens expire or change when user reinstalls app. On app open: always register current token. APNs returns 410 (Gone) for invalid tokens → mark device inactive. FCM returns error.code = "INVALID_REGISTRATION" → same. Batch-check stale tokens periodically. Each user may have multiple devices — send to all active devices.

⚡ Fan-Out at Scale

Celebrity posts → trigger 10M notifications. Naive: one Kafka message per user = 10M messages queued instantly. Better: publish one "fan-out" event, fan-out worker reads follower list and enqueues individual notifications in batches of 1000. Rate-limit per channel (APNs: 300K/sec burst). Use multiple Kafka partitions per channel (push: 100 partitions for 100 parallel workers).

Scaling considerations

What interviewers expect by level

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…