🎯 Design Recommendation System — System Design Interview Guide

Hard · AI & ML Systems

Design a large-scale recommendation system (feed/products/video) that selects, from millions of items, the handful a user is most likely to engage with — in tens of milliseconds.

Open the interactive Recommendation System design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

Non-functional requirements & scale

Capacity estimation

You cannot score 10M items per request in 100ms, so recommendation is a funnel: cheap candidate generation narrows millions → ~hundreds, then an expensive ML ranker scores those. Embeddings + ANN power retrieval; a feature store feeds the ranker. The hard problems are latency, train/serve feature consistency, and avoiding feedback loops.

Core entities

API design

High-level design

On request, the Rec Service fetches user features from the feature store, runs candidate generation (ANN over item embeddings + popular/recent sources), then scores the ~hundreds of candidates with a ranking model, applies business rules (diversity/dedup/freshness), and returns the top-N. Interaction events stream into a real-time feature pipeline and an offline training pipeline that periodically ships new embeddings and ranker models.

Deep dives

🪜 Two-Stage Funnel

Stage 1 (candidate generation / retrieval): cheaply reduce 10M items → ~500 using multiple sources — ANN over user×item embeddings (two-tower model), trending, recently-viewed, follow graph. Optimize for recall, not precision. Stage 2 (ranking): a heavier model scores those ~500 with rich features (user, item, context, cross features) for click/watch probability. This split is what makes sub-100ms over millions of items possible.

🗄️ Feature Store & Train/Serve Skew

The #1 silent killer of rec quality is computing a feature one way in training (batch) and another at serving (online). A feature store provides the same definitions to both: an offline store for training datasets and a low-latency online store (Redis) for serving. Point-in-time-correct joins prevent label leakage; freshness of real-time features (last-5-min clicks) drives responsiveness.

🔀 Two-Tower Retrieval

Train a user tower and an item tower to embed both into one space so relevance ≈ dot product. Precompute all item embeddings into an ANN index offline; at request time embed the user once and do an ANN lookup. This scales retrieval to millions of items in milliseconds and is the workhorse of modern candidate generation.

🎲 Exploration vs Exploitation

Always serving the top predicted items creates feedback loops: the model never learns about items it never shows. Inject exploration (epsilon-greedy, Thompson sampling, or a bandit) and log propensity scores so you can debias training. Add diversity/dedup rules and freshness boosts so the feed is not repetitive or stale.

📈 Online/Offline Evaluation

Offline metrics (AUC, NDCG, recall@K) guide iteration but do not always move business metrics. Validate with online A/B tests on CTR, watch time, and retention. Counterfactual/off-policy evaluation using logged propensities lets you estimate a new policy before shipping. Guard against metric gaming (clickbait) with long-term objectives.

Scaling considerations

What interviewers expect by level

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…