🧬 Design Semantic Search — System Design Interview Guide

Medium · AI & ML Systems

Design a semantic search engine that retrieves documents by meaning, not just keywords — e.g. "how do I cancel my plan" matches "subscription termination steps".

Open the interactive Semantic Search design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

Non-functional requirements & scale

Capacity estimation

100M chunks × 1,536 dims × 4 bytes ≈ 600GB of raw vectors — must be sharded and largely in RAM for an HNSW index. Embedding the corpus is a one-time (re-run on model change) batch job; queries embed one short string at read time. The hard parts are recall, freshness, and filtered ANN.

Core entities

API design

High-level design

Ingest path: a worker chunks each document, calls the embedding model, and upserts vectors + metadata into the Vector DB (with the raw text in object storage / a doc store). Query path: the Search Service embeds the query, runs a filtered ANN search in the Vector DB, runs BM25 in a keyword index, fuses the two rankings, and returns the top-K. A cache short-circuits repeated queries.

Deep dives

✂️ Chunking Strategy

Embeddings degrade on very long text, so split documents into ~200–500 token chunks with ~10–20% overlap so meaning that straddles a boundary is not lost. Prefer semantic boundaries (headings, paragraphs) over fixed sizes. Store chunk→doc linkage so a hit can be expanded to its source. Chunk size is a recall/precision dial — smaller = sharper matches but more vectors.

🔎 ANN with HNSW

Brute-force cosine over 100M vectors is too slow. HNSW builds a layered proximity graph giving O(log N) search with high recall. Tune efSearch (higher = better recall, slower) and M (graph degree). The graph lives in RAM, so shard across nodes and route a query to all shards, then merge top-K.

⚖️ Hybrid Ranking (RRF)

Pure vectors miss exact tokens (product codes, names); pure BM25 misses paraphrases. Run both and fuse with Reciprocal Rank Fusion: score = Σ 1/(k + rank_i). This needs no score normalization and reliably beats either signal alone. Optionally add a cross-encoder re-ranker on the top ~50 for precision.

🔐 Filtered Search & Multi-Tenancy

Most queries are scoped (tenant, ACL, language, recency). Apply filters as metadata predicates DURING the ANN traversal (pre-filtering) rather than filtering after, or recall collapses. Partition or namespace the index per tenant for hard isolation and smaller, faster graphs.

Scaling considerations

What interviewers expect by level

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…