📨 Design Message Queue — System Design Interview Guide
Easy · Fundamentals
Design a message queue system like Apache Kafka or Amazon SQS that enables asynchronous communication between services with guaranteed delivery, at-least-once semantics, and consumer group support.
Open the interactive Message Queue design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.
Functional requirements
- Producers publish messages to named topics
- Consumers subscribe to topics and receive messages
- Multiple consumer groups reading same topic independently
- At-least-once delivery guarantee
- Message retention for configurable period (7 days default)
- Dead Letter Queue for failed messages
Non-functional requirements & scale
- Throughput: 1M messages/sec per topic
- Publish latency < 5ms P99
- Consume latency < 10ms P99
- Messages must be durable — survive broker restart
- Horizontal scaling: add partitions to increase throughput
- Message ordering guaranteed within a partition
Capacity estimation
Message queue decouples producers from consumers. Key concepts: topic (named channel), partition (unit of parallelism), offset (position in partition), consumer group (set of consumers sharing load). Broker stores messages to disk for durability.
Core entities
- Topic — topicName, partitionCount, replicationFactor, retentionMs
- Message — key (optional), value (bytes), timestamp, headers{}, partition, offset
- ConsumerGroup — groupId, topicName, partitionOffsets{partition: offset}, members[]
API design
Internal SDK producer.send(topic, key, value)— Publish message. Key determines partition (hash(key) % partitions).Internal SDK consumer.poll(timeout)— Fetch batch of messages. Returns records up to max.poll.records.Internal SDK consumer.commitOffset()— Commit offset after successful processing.
High-level design
Producer sends message → Leader broker receives → writes to partition log on disk → replicates to followers → acks to producer. Consumer polls leader → receives messages → processes → commits offset.
Deep dives
📦 Partitions & Ordering
Topic = multiple partitions. Message key → hash → partition. Messages within partition are strictly ordered by offset. Across partitions: no global order. Consumers in a consumer group: each partition assigned to exactly one consumer. Partition count = max consumer parallelism. Add partitions to scale throughput (existing messages stay in old partitions). Rebalance: adding consumer to group triggers partition rebalance.
💾 Durability & Replication
Kafka writes messages to disk (sequential I/O — fast). Replication: each partition has 1 leader + N-1 followers. ISR (In-Sync Replicas): followers within max.replication.lag.time. Producer acks: acks=0 (fire-and-forget), acks=1 (leader ack), acks=all (ISR ack). Durable: acks=all + min.insync.replicas=2. Leader failure: controller promotes ISR follower.
🔄 Consumer Offset Management
Consumer tracks position (offset) in each partition. Commit offset after successful processing = at-least-once (process might happen twice if crash before commit). Commit before processing = at-most-once (might lose messages). Exactly-once: process + commit in same transaction (Kafka transactions). DLQ: after N retries, send to dead-letter topic for manual inspection.
⚡ Performance Optimization
Producer batching: accumulate messages in buffer (linger.ms=5ms, batch.size=16KB) → send batch → higher throughput at cost of slight latency. Compression: gzip/snappy/lz4 per batch → 5× compression for text. Consumer fetch: fetch.min.bytes=1MB → wait until 1MB available → fewer round trips. Sequential disk I/O: append-only log on Linux achieves 500MB/s.
Scaling considerations
- Partition count = target throughput / per-partition throughput (typically 10MB/s)
- Consumer group allows horizontal fan-out (add consumers = more parallelism)
- Log compaction: retain only latest value per key (for state stores)
- Kafka Connect: standard framework for DB → Kafka → DB pipelines
- Monitor consumer lag: alert if lag > threshold (consumer falling behind)
What interviewers expect by level
- Junior: Describe producer → topic → consumer flow. Know what a partition is. Understand why async queue decouples services.
- Mid: Partition assignment, consumer groups, offset commit, replication for durability, DLQ.
- Senior: ISR, acks configuration for durability vs latency, exactly-once semantics, partition rebalancing.
- Staff: Multi-region Kafka (MirrorMaker), tiered storage (offload old messages to S3), cost model at 1M msg/sec.
Practice more system design case studies
- Design URL Shortener
- Design Social Media Feed
- Design Chat System
- Design Video Streaming
- Design Ride-Sharing Platform
- Design E-Commerce Platform
- Design UPI Payment Gateway
- Design Google Docs
- Design Tinder
- Design Google Drive / Dropbox
- Design Instagram
- Design Type-Ahead Search
- Design Web Crawler
- Design Ticket Booking (BookMyShow)
- Design Pastebin
- Design Notification System
- Design Rate Limiter (Standalone)
- Design Simple Web App
- Design Food Delivery (Swiggy)
- Design Stock Trading System
- Design Live Streaming (Twitch)
- Design Distributed Key-Value Store
- Design Ad Click Aggregation
- Design Monitoring / Metrics (Datadog)
- Design Online Judge (LeetCode)
- Design FB Post Search
- Design Yelp
- Design Cache Layer
- Design Full Production Stack
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…