🗄️ Design Distributed Key-Value Store — System Design Interview Guide

Hard · Distributed Systems

Design a distributed key-value store like Amazon DynamoDB or Apache Cassandra that provides high availability, horizontal scalability, and tunable consistency.

Open the interactive Distributed Key-Value Store design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

Non-functional requirements & scale

Capacity estimation

Key design decisions: consistent hashing for node assignment, vector clocks for conflict detection, gossip protocol for failure detection, quorum for consistency. CAP: choose AP (like DynamoDB/Cassandra) — always accept writes, resolve conflicts on read.

Core entities

API design

High-level design

Consistent hash ring with virtual nodes. Client SDK routes request to correct node. Coordinator node replicates to N-1 additional nodes. Gossip protocol for cluster membership. Hinted handoff for temporary node failures.

Deep dives

🔵 Consistent Hashing

Hash ring 0 to 2^128. Each node assigned multiple positions (virtual nodes = 150 per physical node). Key → hash → clockwise to nearest vnode = owner. Virtual nodes ensure uniform distribution even with heterogeneous nodes. Node addition/removal: only affects adjacent vnodes → ~5% data movement. Replication: coordinator + next N-1 nodes clockwise.

📦 Quorum Reads/Writes

N=3 replicas. W=2 (write quorum). R=2 (read quorum). W+R > N ensures at least 1 overlapping node has latest value. Read repair: during quorum read, if one replica returns older version → async update. Sloppy quorum: if a replica is down, write to any available node (hinted handoff) → sync when node recovers. Tunable: W=1 (fast writes) vs W=3 (strong durability).

🔀 Conflict Resolution

Concurrent writes: vector clock tracks causality. Vector clock = {nodeId: counter} per key. On write: increment clock. On read with conflicting clocks: user resolves (DynamoDB "last writer wins" by default) or application handles (shopping cart = merge). Anti-entropy: Merkle tree per node compares key ranges — detect missing/inconsistent data between replicas.

💡 Failure Detection

Gossip protocol: each node periodically (every 1s) sends heartbeat list to 3 random nodes. Each node maintains {nodeId: lastSeen}. If lastSeen > threshold (5s): mark SUSPECT; > 10s: DEAD. Dead node triggers re-replication of its key range to maintain N replicas. Phi accrual failure detector: adaptive threshold based on network conditions.

Scaling considerations

What interviewers expect by level

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…