🗄️ Design Distributed Key-Value Store — System Design Interview Guide

Hard · Distributed Systems

Design a distributed key-value store like Amazon DynamoDB or Apache Cassandra that provides high availability, horizontal scalability, and tunable consistency.

Open the interactive Distributed Key-Value Store design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

GET(key) → value; PUT(key, value); DELETE(key)
Horizontal scaling: add nodes to increase capacity
Replication: data stored on multiple nodes for durability
Tunable consistency: choose between eventual and strong
Automatic failure detection and data re-replication

Non-functional requirements & scale

1M reads/sec and 100K writes/sec
Read latency < 1ms, write latency < 5ms
Data replicated on 3 nodes (N=3); quorum read/write (R=2, W=2)
99.99% availability (AP system by default)
Linear horizontal scalability: double nodes = double throughput
No single point of failure

Capacity estimation

Key design decisions: consistent hashing for node assignment, vector clocks for conflict detection, gossip protocol for failure detection, quorum for consistency. CAP: choose AP (like DynamoDB/Cassandra) — always accept writes, resolve conflicts on read.

Core entities

KeyValue — key (string), value (bytes), version (vector clock), ttl?, metadata
VirtualNode — vnodeId, physicalNode, tokenRange (start, end on hash ring)
NodeMetadata — nodeId, host, port, status, tokens[], lastHeartbeat

API design

GET GET /keys/:key?consistency=eventual|strong — Returns value and version. Eventual: R=1. Strong: R=quorum.
PUT PUT /keys/:key — Write value. Body: { value, ttl? }. Returns { version }.
DELETE DELETE /keys/:key — Delete (tombstone marker — not immediate).

High-level design

Consistent hash ring with virtual nodes. Client SDK routes request to correct node. Coordinator node replicates to N-1 additional nodes. Gossip protocol for cluster membership. Hinted handoff for temporary node failures.

Deep dives

🔵 Consistent Hashing

Hash ring 0 to 2^128. Each node assigned multiple positions (virtual nodes = 150 per physical node). Key → hash → clockwise to nearest vnode = owner. Virtual nodes ensure uniform distribution even with heterogeneous nodes. Node addition/removal: only affects adjacent vnodes → ~5% data movement. Replication: coordinator + next N-1 nodes clockwise.

📦 Quorum Reads/Writes

N=3 replicas. W=2 (write quorum). R=2 (read quorum). W+R > N ensures at least 1 overlapping node has latest value. Read repair: during quorum read, if one replica returns older version → async update. Sloppy quorum: if a replica is down, write to any available node (hinted handoff) → sync when node recovers. Tunable: W=1 (fast writes) vs W=3 (strong durability).

🔀 Conflict Resolution

Concurrent writes: vector clock tracks causality. Vector clock = {nodeId: counter} per key. On write: increment clock. On read with conflicting clocks: user resolves (DynamoDB "last writer wins" by default) or application handles (shopping cart = merge). Anti-entropy: Merkle tree per node compares key ranges — detect missing/inconsistent data between replicas.

💡 Failure Detection

Gossip protocol: each node periodically (every 1s) sends heartbeat list to 3 random nodes. Each node maintains {nodeId: lastSeen}. If lastSeen > threshold (5s): mark SUSPECT; > 10s: DEAD. Dead node triggers re-replication of its key range to maintain N replicas. Phi accrual failure detector: adaptive threshold based on network conditions.

Scaling considerations

Virtual nodes (150 per physical node) for balanced data distribution
LSM-tree storage engine (LevelDB/RocksDB) for write-optimized persistence
Bloom filter per SSTable: O(1) check if key exists before disk read
Compaction: merge SSTables periodically to reclaim space from tombstones
Read path: check MemTable → L0 SSTables → L1 SSTables (bloom filter first)

What interviewers expect by level

Junior: Describe GET/PUT operations, replication concept. Understand why a single DB node cannot handle 1M reads/sec.
Mid: Consistent hashing, N/R/W quorum configuration, vector clocks for versioning.
Senior: Full Dynamo-style design: virtual nodes, sloppy quorum, hinted handoff, anti-entropy with Merkle trees.
Staff: LSM-tree internals, compaction strategies, cross-datacenter replication, migration from SQL to distributed KV.

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…