💬 Design Chat System — System Design Interview Guide
Medium · Messaging & Real-Time
Design a real-time messaging system like WhatsApp or Slack that supports 1-to-1 messaging, group chats, message delivery guarantees, and online presence.
Open the interactive Chat System design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.
Functional requirements
- Send and receive 1-to-1 messages in real-time
- Group chats with up to 500 members
- Message delivery receipts: sent ✓, delivered ✓✓, read ✓✓ (blue)
- Online/offline presence indicator
- Message history: load past messages with pagination
- Support for media messages (images, files)
Non-functional requirements & scale
- 2B registered users; 500M DAU (WhatsApp scale)
- Message delivery latency < 100ms (P99)
- Messages must be durable — no data loss
- Presence updates: eventual consistency acceptable
- End-to-end encryption for all messages
- Support for offline message delivery (push notifications)
Capacity estimation
500M DAU, each sends ~40 messages/day = 20B messages/day = 231K messages/sec. Each message ~1KB. Storage: 20B × 1KB = 20TB/day. Need persistent connections (WebSocket) for real-time delivery. Stateless HTTP cannot push messages to clients.
Core entities
- User — userId, phoneNumber, displayName, publicKey (E2E), lastSeen
- Message — messageId (Snowflake), chatId, senderId, content (encrypted), type, status, createdAt
- Chat — chatId, type (1-to-1|group), memberIds[], createdAt, lastMessageId
- Presence — userId, isOnline, lastSeen, serverId (which WS server)
API design
WS wss://chat.app/ws— Persistent WebSocket connection. Client subscribes to their userId channel.POST /api/v1/messages— Send message. Body: { chatId, content, type }. Fallback for when WS unavailable.GET /api/v1/chats/:chatId/messages?cursor=— Paginated message history.POST /api/v1/chats— Create group chat. Body: { name, memberIds[] }.
High-level design
Client connects via WebSocket to a Chat Server. Message sent → stored in Cassandra → routed to recipient's WebSocket server via Redis Pub/Sub → delivered over WS or push notification if offline.
Deep dives
🔌 WebSocket at Scale
Each Chat Server holds N persistent WS connections. Problem: User A on Server 1, User B on Server 2 — how to route? Solution: Redis Pub/Sub. Server 1 publishes to channel "user:B". Server 2 subscribes to "user:B" and pushes to User B's connection. Presence service tracks which server each user is on.
✉️ Message Delivery Guarantee
At-least-once delivery: ack after writing to Cassandra. Client uses message ID to deduplicate. Delivery receipts: recipient sends ack back over WS → server updates message status in DB → notifies sender. Read receipts: client sends read event for chatId up to messageId.
👥 Group Chat Scaling
For small groups (<100): fan-out to all members' WS servers via Redis Pub/Sub. For large groups (100-500): store group membership in DB; on message, publish once to group topic; each member's server subscribes to group topic. Cassandra stores messages with chatId partition key for efficient range reads.
📱 Offline Delivery
If recipient's WS server returns offline: push via APNs (iOS) or FCM (Android). Message is stored in Cassandra regardless. On reconnect: client sends lastSeenMessageId; server returns all undelivered messages since then. This handles network interruptions transparently.
Scaling considerations
- Cassandra partitioned by chatId; clustering key = messageId (Snowflake) for time ordering
- Redis Pub/Sub replaced by Kafka for higher throughput group messaging
- Consistent hashing for WebSocket server assignment to minimize reconnects
- Presence: use heartbeat every 30s; Redis key TTL = 35s (auto-expire on disconnect)
- Media: upload directly to S3 via pre-signed URL; store S3 key in message
What interviewers expect by level
- Junior: Describe WebSocket for real-time, store messages in DB, handle offline with push notifications.
- Mid: Design Redis Pub/Sub routing between WS servers, Cassandra schema, delivery receipts flow.
- Senior: Full group chat fan-out, Redis cluster for pub/sub, Kafka at scale, E2E encryption key exchange.
- Staff: Multi-region with message sync, GDPR compliance (right-to-delete), cost model for 20TB/day storage.
Practice more system design case studies
- Design URL Shortener
- Design Social Media Feed
- Design Video Streaming
- Design Ride-Sharing Platform
- Design E-Commerce Platform
- Design UPI Payment Gateway
- Design Google Docs
- Design Tinder
- Design Google Drive / Dropbox
- Design Instagram
- Design Type-Ahead Search
- Design Web Crawler
- Design Ticket Booking (BookMyShow)
- Design Pastebin
- Design Notification System
- Design Rate Limiter (Standalone)
- Design Simple Web App
- Design Food Delivery (Swiggy)
- Design Stock Trading System
- Design Live Streaming (Twitch)
- Design Distributed Key-Value Store
- Design Ad Click Aggregation
- Design Monitoring / Metrics (Datadog)
- Design Online Judge (LeetCode)
- Design FB Post Search
- Design Yelp
- Design Cache Layer
- Design Message Queue
- Design Full Production Stack
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…