🧬 Design Semantic Search — System Design Interview Guide
Medium · AI & ML Systems
Design a semantic search engine that retrieves documents by meaning, not just keywords — e.g. "how do I cancel my plan" matches "subscription termination steps".
Open the interactive Semantic Search design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.
Functional requirements
- Ingest documents: chunk, embed, and index them
- Query by natural language; return top-K most relevant chunks
- Hybrid ranking: combine semantic (vector) + keyword (BM25) scores
- Metadata filters (tenant, language, recency, permissions)
- Near-real-time index updates as documents change
Non-functional requirements & scale
- 100M document chunks, 1,536-dim embeddings
- Query latency p95 < 150ms end-to-end
- Recall@10 > 90% vs an offline ground-truth set
- Index updates visible within minutes of a document change
- Multi-tenant isolation — a tenant never sees another's data
Capacity estimation
100M chunks × 1,536 dims × 4 bytes ≈ 600GB of raw vectors — must be sharded and largely in RAM for an HNSW index. Embedding the corpus is a one-time (re-run on model change) batch job; queries embed one short string at read time. The hard parts are recall, freshness, and filtered ANN.
Core entities
- Document — docId, tenantId, source, title, updatedAt, aclGroups[]
- Chunk — chunkId, docId, text, embedding (vector), position, tokenCount
- Query — queryText, embedding, filters, topK, userId
API design
POST /api/v1/index— Ingest a document. Body: { docId, text, metadata }. Async chunk+embed+upsert.POST /api/v1/search— Body: { query, filters, topK }. Returns ranked chunks with scores.DELETE /api/v1/index/:docId— Remove a document and its chunks from the index.
High-level design
Ingest path: a worker chunks each document, calls the embedding model, and upserts vectors + metadata into the Vector DB (with the raw text in object storage / a doc store). Query path: the Search Service embeds the query, runs a filtered ANN search in the Vector DB, runs BM25 in a keyword index, fuses the two rankings, and returns the top-K. A cache short-circuits repeated queries.
Deep dives
✂️ Chunking Strategy
Embeddings degrade on very long text, so split documents into ~200–500 token chunks with ~10–20% overlap so meaning that straddles a boundary is not lost. Prefer semantic boundaries (headings, paragraphs) over fixed sizes. Store chunk→doc linkage so a hit can be expanded to its source. Chunk size is a recall/precision dial — smaller = sharper matches but more vectors.
🔎 ANN with HNSW
Brute-force cosine over 100M vectors is too slow. HNSW builds a layered proximity graph giving O(log N) search with high recall. Tune efSearch (higher = better recall, slower) and M (graph degree). The graph lives in RAM, so shard across nodes and route a query to all shards, then merge top-K.
⚖️ Hybrid Ranking (RRF)
Pure vectors miss exact tokens (product codes, names); pure BM25 misses paraphrases. Run both and fuse with Reciprocal Rank Fusion: score = Σ 1/(k + rank_i). This needs no score normalization and reliably beats either signal alone. Optionally add a cross-encoder re-ranker on the top ~50 for precision.
🔐 Filtered Search & Multi-Tenancy
Most queries are scoped (tenant, ACL, language, recency). Apply filters as metadata predicates DURING the ANN traversal (pre-filtering) rather than filtering after, or recall collapses. Partition or namespace the index per tenant for hard isolation and smaller, faster graphs.
Scaling considerations
- Shard the vector index by hash; scatter-gather queries and merge top-K
- Keep HNSW graphs in RAM; use quantization (PQ/int8) to fit more vectors per node
- Re-embedding on a model upgrade is a full corpus rebuild — version your indexes
- Cache popular queries and their results; invalidate on relevant doc updates
- Batch embedding calls on the ingest path for GPU efficiency
What interviewers expect by level
- Junior: Explain embeddings, cosine similarity, and the embed→index→search flow.
- Mid: Design chunking, an ANN (HNSW) index, query caching, and basic metadata filters.
- Senior: Add hybrid BM25+vector fusion, sharding/scatter-gather, pre-filtered ANN, multi-tenant isolation.
- Staff: Index versioning + zero-downtime re-embedding, recall evaluation harness, quantization cost trade-offs, cross-encoder re-ranking.
Practice more system design case studies
- Design URL Shortener
- Design Social Media Feed
- Design Chat System
- Design Video Streaming
- Design Ride-Sharing Platform
- Design E-Commerce Platform
- Design UPI Payment Gateway
- Design Google Docs
- Design Tinder
- Design Google Drive / Dropbox
- Design Instagram
- Design Type-Ahead Search
- Design Web Crawler
- Design Ticket Booking (BookMyShow)
- Design Pastebin
- Design Notification System
- Design Rate Limiter (Standalone)
- Design Simple Web App
- Design Food Delivery (Swiggy)
- Design Stock Trading System
- Design Live Streaming (Twitch)
- Design Distributed Key-Value Store
- Design Ad Click Aggregation
- Design Monitoring / Metrics (Datadog)
- Design Online Judge (LeetCode)
- Design FB Post Search
- Design Yelp
- Design Cache Layer
- Design Message Queue
- Design Full Production Stack
- Design AI Chatbot
- Design RAG System
- Design LLM Serving Platform
- Design Recommendation System
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…