🎯 Design Recommendation System — System Design Interview Guide

Hard · AI & ML Systems

Design a large-scale recommendation system (feed/products/video) that selects, from millions of items, the handful a user is most likely to engage with — in tens of milliseconds.

Open the interactive Recommendation System design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

Personalized ranking of items per user request
Two-stage: candidate generation → heavy ranking
Incorporate real-time signals (recent clicks) and long-term profile
Business rules: dedup, diversity, freshness, blocked items
Exploration of new items (avoid feedback loops)

Non-functional requirements & scale

100M users, 10M+ items, 50K recommendation requests/sec
End-to-end ranking latency p99 < 100ms
Model + features refreshed continuously; no stale feedback loops
Online metrics (CTR, watch time) tied to offline training
Consistent features between training and serving (no skew)

Capacity estimation

You cannot score 10M items per request in 100ms, so recommendation is a funnel: cheap candidate generation narrows millions → ~hundreds, then an expensive ML ranker scores those. Embeddings + ANN power retrieval; a feature store feeds the ranker. The hard problems are latency, train/serve feature consistency, and avoiding feedback loops.

Core entities

User — userId, profileEmbedding, recentEvents[], demographics
Item — itemId, itemEmbedding, metadata, freshness, popularity
Interaction — userId, itemId, type (click/like/watch), timestamp, context
Feature — entityId, featureName, value, version (for the feature store)

API design

GET /api/v1/recommendations — Params: { userId, context, count }. Returns a ranked item list.
POST /api/v1/events — Log an interaction (click/like/watch) for training + real-time features.
POST /api/v1/items — Register/update an item; triggers embedding + indexing.

High-level design

On request, the Rec Service fetches user features from the feature store, runs candidate generation (ANN over item embeddings + popular/recent sources), then scores the ~hundreds of candidates with a ranking model, applies business rules (diversity/dedup/freshness), and returns the top-N. Interaction events stream into a real-time feature pipeline and an offline training pipeline that periodically ships new embeddings and ranker models.

Deep dives

🪜 Two-Stage Funnel

Stage 1 (candidate generation / retrieval): cheaply reduce 10M items → ~500 using multiple sources — ANN over user×item embeddings (two-tower model), trending, recently-viewed, follow graph. Optimize for recall, not precision. Stage 2 (ranking): a heavier model scores those ~500 with rich features (user, item, context, cross features) for click/watch probability. This split is what makes sub-100ms over millions of items possible.

🗄️ Feature Store & Train/Serve Skew

The #1 silent killer of rec quality is computing a feature one way in training (batch) and another at serving (online). A feature store provides the same definitions to both: an offline store for training datasets and a low-latency online store (Redis) for serving. Point-in-time-correct joins prevent label leakage; freshness of real-time features (last-5-min clicks) drives responsiveness.

🔀 Two-Tower Retrieval

Train a user tower and an item tower to embed both into one space so relevance ≈ dot product. Precompute all item embeddings into an ANN index offline; at request time embed the user once and do an ANN lookup. This scales retrieval to millions of items in milliseconds and is the workhorse of modern candidate generation.

🎲 Exploration vs Exploitation

Always serving the top predicted items creates feedback loops: the model never learns about items it never shows. Inject exploration (epsilon-greedy, Thompson sampling, or a bandit) and log propensity scores so you can debias training. Add diversity/dedup rules and freshness boosts so the feed is not repetitive or stale.

📈 Online/Offline Evaluation

Offline metrics (AUC, NDCG, recall@K) guide iteration but do not always move business metrics. Validate with online A/B tests on CTR, watch time, and retention. Counterfactual/off-policy evaluation using logged propensities lets you estimate a new policy before shipping. Guard against metric gaming (clickbait) with long-term objectives.

Scaling considerations

Precompute item embeddings + ANN index offline; refresh as items/models change
Online feature store (Redis) with tight p99; degrade gracefully to cached/popular on miss
Rank only a few hundred candidates per request to hold the latency budget
Kafka events fan out to both real-time feature updates and offline training
Shadow/A-B new models; roll out via traffic splitting with guardrail metrics

What interviewers expect by level

Junior: Explain candidate generation vs ranking and why we can't score every item.
Mid: Design the two-stage funnel, embeddings + ANN retrieval, a feature store, event logging.
Senior: Two-tower retrieval, train/serve skew prevention, real-time features, exploration, A/B evaluation.
Staff: Off-policy evaluation, feedback-loop mitigation, multi-objective ranking, end-to-end ML platform + model governance.

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…