🧬 Design Semantic Search — System Design Interview Guide

Medium · AI & ML Systems

Design a semantic search engine that retrieves documents by meaning, not just keywords — e.g. "how do I cancel my plan" matches "subscription termination steps".

Open the interactive Semantic Search design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

Ingest documents: chunk, embed, and index them
Query by natural language; return top-K most relevant chunks
Hybrid ranking: combine semantic (vector) + keyword (BM25) scores
Metadata filters (tenant, language, recency, permissions)
Near-real-time index updates as documents change

Non-functional requirements & scale

100M document chunks, 1,536-dim embeddings
Query latency p95 < 150ms end-to-end
Recall@10 > 90% vs an offline ground-truth set
Index updates visible within minutes of a document change
Multi-tenant isolation — a tenant never sees another's data

Capacity estimation

100M chunks × 1,536 dims × 4 bytes ≈ 600GB of raw vectors — must be sharded and largely in RAM for an HNSW index. Embedding the corpus is a one-time (re-run on model change) batch job; queries embed one short string at read time. The hard parts are recall, freshness, and filtered ANN.

Core entities

Document — docId, tenantId, source, title, updatedAt, aclGroups[]
Chunk — chunkId, docId, text, embedding (vector), position, tokenCount
Query — queryText, embedding, filters, topK, userId

API design

POST /api/v1/index — Ingest a document. Body: { docId, text, metadata }. Async chunk+embed+upsert.
POST /api/v1/search — Body: { query, filters, topK }. Returns ranked chunks with scores.
DELETE /api/v1/index/:docId — Remove a document and its chunks from the index.

High-level design

Ingest path: a worker chunks each document, calls the embedding model, and upserts vectors + metadata into the Vector DB (with the raw text in object storage / a doc store). Query path: the Search Service embeds the query, runs a filtered ANN search in the Vector DB, runs BM25 in a keyword index, fuses the two rankings, and returns the top-K. A cache short-circuits repeated queries.

Deep dives

✂️ Chunking Strategy

Embeddings degrade on very long text, so split documents into ~200–500 token chunks with ~10–20% overlap so meaning that straddles a boundary is not lost. Prefer semantic boundaries (headings, paragraphs) over fixed sizes. Store chunk→doc linkage so a hit can be expanded to its source. Chunk size is a recall/precision dial — smaller = sharper matches but more vectors.

🔎 ANN with HNSW

Brute-force cosine over 100M vectors is too slow. HNSW builds a layered proximity graph giving O(log N) search with high recall. Tune efSearch (higher = better recall, slower) and M (graph degree). The graph lives in RAM, so shard across nodes and route a query to all shards, then merge top-K.

⚖️ Hybrid Ranking (RRF)

Pure vectors miss exact tokens (product codes, names); pure BM25 misses paraphrases. Run both and fuse with Reciprocal Rank Fusion: score = Σ 1/(k + rank_i). This needs no score normalization and reliably beats either signal alone. Optionally add a cross-encoder re-ranker on the top ~50 for precision.

🔐 Filtered Search & Multi-Tenancy

Most queries are scoped (tenant, ACL, language, recency). Apply filters as metadata predicates DURING the ANN traversal (pre-filtering) rather than filtering after, or recall collapses. Partition or namespace the index per tenant for hard isolation and smaller, faster graphs.

Scaling considerations

Shard the vector index by hash; scatter-gather queries and merge top-K
Keep HNSW graphs in RAM; use quantization (PQ/int8) to fit more vectors per node
Re-embedding on a model upgrade is a full corpus rebuild — version your indexes
Cache popular queries and their results; invalidate on relevant doc updates
Batch embedding calls on the ingest path for GPU efficiency

What interviewers expect by level

Junior: Explain embeddings, cosine similarity, and the embed→index→search flow.
Mid: Design chunking, an ANN (HNSW) index, query caching, and basic metadata filters.
Senior: Add hybrid BM25+vector fusion, sharding/scatter-gather, pre-filtered ANN, multi-tenant isolation.
Staff: Index versioning + zero-downtime re-embedding, recall evaluation harness, quantization cost trade-offs, cross-encoder re-ranking.

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…