📨 Design Message Queue — System Design Interview Guide

Easy · Fundamentals

Design a message queue system like Apache Kafka or Amazon SQS that enables asynchronous communication between services with guaranteed delivery, at-least-once semantics, and consumer group support.

Open the interactive Message Queue design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

Non-functional requirements & scale

Capacity estimation

Message queue decouples producers from consumers. Key concepts: topic (named channel), partition (unit of parallelism), offset (position in partition), consumer group (set of consumers sharing load). Broker stores messages to disk for durability.

Core entities

API design

High-level design

Producer sends message → Leader broker receives → writes to partition log on disk → replicates to followers → acks to producer. Consumer polls leader → receives messages → processes → commits offset.

Deep dives

📦 Partitions & Ordering

Topic = multiple partitions. Message key → hash → partition. Messages within partition are strictly ordered by offset. Across partitions: no global order. Consumers in a consumer group: each partition assigned to exactly one consumer. Partition count = max consumer parallelism. Add partitions to scale throughput (existing messages stay in old partitions). Rebalance: adding consumer to group triggers partition rebalance.

💾 Durability & Replication

Kafka writes messages to disk (sequential I/O — fast). Replication: each partition has 1 leader + N-1 followers. ISR (In-Sync Replicas): followers within max.replication.lag.time. Producer acks: acks=0 (fire-and-forget), acks=1 (leader ack), acks=all (ISR ack). Durable: acks=all + min.insync.replicas=2. Leader failure: controller promotes ISR follower.

🔄 Consumer Offset Management

Consumer tracks position (offset) in each partition. Commit offset after successful processing = at-least-once (process might happen twice if crash before commit). Commit before processing = at-most-once (might lose messages). Exactly-once: process + commit in same transaction (Kafka transactions). DLQ: after N retries, send to dead-letter topic for manual inspection.

⚡ Performance Optimization

Producer batching: accumulate messages in buffer (linger.ms=5ms, batch.size=16KB) → send batch → higher throughput at cost of slight latency. Compression: gzip/snappy/lz4 per batch → 5× compression for text. Consumer fetch: fetch.min.bytes=1MB → wait until 1MB available → fewer round trips. Sequential disk I/O: append-only log on Linux achieves 500MB/s.

Scaling considerations

What interviewers expect by level

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…