📨 Design Message Queue — System Design Interview Guide

Easy · Fundamentals

Design a message queue system like Apache Kafka or Amazon SQS that enables asynchronous communication between services with guaranteed delivery, at-least-once semantics, and consumer group support.

Open the interactive Message Queue design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

Producers publish messages to named topics
Consumers subscribe to topics and receive messages
Multiple consumer groups reading same topic independently
At-least-once delivery guarantee
Message retention for configurable period (7 days default)
Dead Letter Queue for failed messages

Non-functional requirements & scale

Throughput: 1M messages/sec per topic
Publish latency < 5ms P99
Consume latency < 10ms P99
Messages must be durable — survive broker restart
Horizontal scaling: add partitions to increase throughput
Message ordering guaranteed within a partition

Capacity estimation

Message queue decouples producers from consumers. Key concepts: topic (named channel), partition (unit of parallelism), offset (position in partition), consumer group (set of consumers sharing load). Broker stores messages to disk for durability.

Core entities

Topic — topicName, partitionCount, replicationFactor, retentionMs
Message — key (optional), value (bytes), timestamp, headers{}, partition, offset
ConsumerGroup — groupId, topicName, partitionOffsets{partition: offset}, members[]

API design

Internal SDK producer.send(topic, key, value) — Publish message. Key determines partition (hash(key) % partitions).
Internal SDK consumer.poll(timeout) — Fetch batch of messages. Returns records up to max.poll.records.
Internal SDK consumer.commitOffset() — Commit offset after successful processing.

High-level design

Producer sends message → Leader broker receives → writes to partition log on disk → replicates to followers → acks to producer. Consumer polls leader → receives messages → processes → commits offset.

Deep dives

📦 Partitions & Ordering

Topic = multiple partitions. Message key → hash → partition. Messages within partition are strictly ordered by offset. Across partitions: no global order. Consumers in a consumer group: each partition assigned to exactly one consumer. Partition count = max consumer parallelism. Add partitions to scale throughput (existing messages stay in old partitions). Rebalance: adding consumer to group triggers partition rebalance.

💾 Durability & Replication

Kafka writes messages to disk (sequential I/O — fast). Replication: each partition has 1 leader + N-1 followers. ISR (In-Sync Replicas): followers within max.replication.lag.time. Producer acks: acks=0 (fire-and-forget), acks=1 (leader ack), acks=all (ISR ack). Durable: acks=all + min.insync.replicas=2. Leader failure: controller promotes ISR follower.

🔄 Consumer Offset Management

Consumer tracks position (offset) in each partition. Commit offset after successful processing = at-least-once (process might happen twice if crash before commit). Commit before processing = at-most-once (might lose messages). Exactly-once: process + commit in same transaction (Kafka transactions). DLQ: after N retries, send to dead-letter topic for manual inspection.

⚡ Performance Optimization

Producer batching: accumulate messages in buffer (linger.ms=5ms, batch.size=16KB) → send batch → higher throughput at cost of slight latency. Compression: gzip/snappy/lz4 per batch → 5× compression for text. Consumer fetch: fetch.min.bytes=1MB → wait until 1MB available → fewer round trips. Sequential disk I/O: append-only log on Linux achieves 500MB/s.

Scaling considerations

Partition count = target throughput / per-partition throughput (typically 10MB/s)
Consumer group allows horizontal fan-out (add consumers = more parallelism)
Log compaction: retain only latest value per key (for state stores)
Kafka Connect: standard framework for DB → Kafka → DB pipelines
Monitor consumer lag: alert if lag > threshold (consumer falling behind)

What interviewers expect by level

Junior: Describe producer → topic → consumer flow. Know what a partition is. Understand why async queue decouples services.
Mid: Partition assignment, consumer groups, offset commit, replication for durability, DLQ.
Senior: ISR, acks configuration for durability vs latency, exactly-once semantics, partition rebalancing.
Staff: Multi-region Kafka (MirrorMaker), tiered storage (offload old messages to S3), cost model at 1M msg/sec.

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…