🗄️ Design Distributed Key-Value Store — System Design Interview Guide
Hard · Distributed Systems
Design a distributed key-value store like Amazon DynamoDB or Apache Cassandra that provides high availability, horizontal scalability, and tunable consistency.
Open the interactive Distributed Key-Value Store design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.
Functional requirements
- GET(key) → value; PUT(key, value); DELETE(key)
- Horizontal scaling: add nodes to increase capacity
- Replication: data stored on multiple nodes for durability
- Tunable consistency: choose between eventual and strong
- Automatic failure detection and data re-replication
Non-functional requirements & scale
- 1M reads/sec and 100K writes/sec
- Read latency < 1ms, write latency < 5ms
- Data replicated on 3 nodes (N=3); quorum read/write (R=2, W=2)
- 99.99% availability (AP system by default)
- Linear horizontal scalability: double nodes = double throughput
- No single point of failure
Capacity estimation
Key design decisions: consistent hashing for node assignment, vector clocks for conflict detection, gossip protocol for failure detection, quorum for consistency. CAP: choose AP (like DynamoDB/Cassandra) — always accept writes, resolve conflicts on read.
Core entities
- KeyValue — key (string), value (bytes), version (vector clock), ttl?, metadata
- VirtualNode — vnodeId, physicalNode, tokenRange (start, end on hash ring)
- NodeMetadata — nodeId, host, port, status, tokens[], lastHeartbeat
API design
GET GET /keys/:key?consistency=eventual|strong— Returns value and version. Eventual: R=1. Strong: R=quorum.PUT PUT /keys/:key— Write value. Body: { value, ttl? }. Returns { version }.DELETE DELETE /keys/:key— Delete (tombstone marker — not immediate).
High-level design
Consistent hash ring with virtual nodes. Client SDK routes request to correct node. Coordinator node replicates to N-1 additional nodes. Gossip protocol for cluster membership. Hinted handoff for temporary node failures.
Deep dives
🔵 Consistent Hashing
Hash ring 0 to 2^128. Each node assigned multiple positions (virtual nodes = 150 per physical node). Key → hash → clockwise to nearest vnode = owner. Virtual nodes ensure uniform distribution even with heterogeneous nodes. Node addition/removal: only affects adjacent vnodes → ~5% data movement. Replication: coordinator + next N-1 nodes clockwise.
📦 Quorum Reads/Writes
N=3 replicas. W=2 (write quorum). R=2 (read quorum). W+R > N ensures at least 1 overlapping node has latest value. Read repair: during quorum read, if one replica returns older version → async update. Sloppy quorum: if a replica is down, write to any available node (hinted handoff) → sync when node recovers. Tunable: W=1 (fast writes) vs W=3 (strong durability).
🔀 Conflict Resolution
Concurrent writes: vector clock tracks causality. Vector clock = {nodeId: counter} per key. On write: increment clock. On read with conflicting clocks: user resolves (DynamoDB "last writer wins" by default) or application handles (shopping cart = merge). Anti-entropy: Merkle tree per node compares key ranges — detect missing/inconsistent data between replicas.
💡 Failure Detection
Gossip protocol: each node periodically (every 1s) sends heartbeat list to 3 random nodes. Each node maintains {nodeId: lastSeen}. If lastSeen > threshold (5s): mark SUSPECT; > 10s: DEAD. Dead node triggers re-replication of its key range to maintain N replicas. Phi accrual failure detector: adaptive threshold based on network conditions.
Scaling considerations
- Virtual nodes (150 per physical node) for balanced data distribution
- LSM-tree storage engine (LevelDB/RocksDB) for write-optimized persistence
- Bloom filter per SSTable: O(1) check if key exists before disk read
- Compaction: merge SSTables periodically to reclaim space from tombstones
- Read path: check MemTable → L0 SSTables → L1 SSTables (bloom filter first)
What interviewers expect by level
- Junior: Describe GET/PUT operations, replication concept. Understand why a single DB node cannot handle 1M reads/sec.
- Mid: Consistent hashing, N/R/W quorum configuration, vector clocks for versioning.
- Senior: Full Dynamo-style design: virtual nodes, sloppy quorum, hinted handoff, anti-entropy with Merkle trees.
- Staff: LSM-tree internals, compaction strategies, cross-datacenter replication, migration from SQL to distributed KV.
Practice more system design case studies
- Design URL Shortener
- Design Social Media Feed
- Design Chat System
- Design Video Streaming
- Design Ride-Sharing Platform
- Design E-Commerce Platform
- Design UPI Payment Gateway
- Design Google Docs
- Design Tinder
- Design Google Drive / Dropbox
- Design Instagram
- Design Type-Ahead Search
- Design Web Crawler
- Design Ticket Booking (BookMyShow)
- Design Pastebin
- Design Notification System
- Design Rate Limiter (Standalone)
- Design Simple Web App
- Design Food Delivery (Swiggy)
- Design Stock Trading System
- Design Live Streaming (Twitch)
- Design Ad Click Aggregation
- Design Monitoring / Metrics (Datadog)
- Design Online Judge (LeetCode)
- Design FB Post Search
- Design Yelp
- Design Cache Layer
- Design Message Queue
- Design Full Production Stack
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…