🧠 Vector Search (HNSW) — AI / ML Interview Guide
LLM Applications · interactive visualization + interview prep
Open the interactive Vector Search (HNSW) visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.
What it is
Finding the nearest vector by comparing the query to all N vectors is too slow at millions of items. HNSW builds a layered "small-world" graph: sparse long hops at the top get you near the right region fast, then denser layers refine the search — so you only touch a handful of nodes instead of all of them.
Mental model
Searching a million vectors one-by-one is the "drive every road" problem. HNSW builds a layered map: the TOP layer is a sparse highway network of a few far-apart hubs; each lower layer adds more local streets. A search starts on the highway to jump near the target REGION in a few big hops, then descends layer by layer to local streets to pin the exact nearest neighbor — touching a handful of nodes instead of all of them.
Theory
Exact nearest-neighbor search compares the query to all N vectors — O(N·d) per query, which is hopeless at millions/billions of items and many queries per second. So vector search accepts APPROXIMATE nearest neighbors (ANN): trade a little recall (occasionally missing the true closest) for orders-of-magnitude speed. HNSW is the dominant ANN graph index.
The core engine is greedy search on a "small-world" proximity graph: from a starting node, repeatedly move to whichever neighbor is closest to the query, stopping when no neighbor is closer. On a well-connected graph this reaches a near-neighbor quickly, but on a single flat graph it needs many short steps to cross the space.
The hierarchy fixes that. Each node is assigned a random maximum level from a geometric distribution, so few nodes reach high levels. Upper layers are sparse graphs of only the high-level nodes, giving long-range "highway" hops; lower layers are progressively denser. Search enters at the top, greedily hops to get near the region, then DESCENDS a layer and refines — reaching the neighborhood in roughly O(log N) hops.
Two parameters tune the recall / latency / memory triangle. M is the number of neighbors per node (graph degree): higher M raises recall and memory. efSearch is the size of the candidate list kept during search: higher efSearch raises recall but slows queries (efConstruction does the same at build time). There is no free lunch — you pick the operating point your application needs.
Versus alternatives: IVF (inverted file) clusters vectors and only scans a few cells — lower memory, easy to shard, slightly lower recall — and is often combined with Product Quantization (PQ) to compress vectors. HNSW gives the best recall/latency but is memory-heavy (M links per vector) and costlier to update incrementally. Knowing this trade-off is the typical interview payoff.
Concrete example
A vector DB (Pinecone, pgvector, FAISS-HNSW) storing 10M document embeddings. A RAG query embeds the question and asks for the top-k nearest chunks; HNSW returns them in milliseconds by hopping through the graph instead of scanning all 10M.
Key equations
each node gets a random max level ℓ ~ ⌊−ln(U)·mL⌋ (few nodes reach high levels)layer ℓ = a proximity graph over nodes with level ≥ ℓsearch: enter at the top layer’s entry pointgreedily hop to the neighbor closest to the query; when none is closer, descendrepeat down to layer 0 → approximate nearest neighbor (ANN)
Step by step
- Nodes are vectors; bigger/brighter = higher max layer.
- Search starts at the entry point on the top layer.
- At each layer, greedily move to the neighbor nearest the query (cyan hops).
- Descend a layer and repeat — the search zooms in.
- Compare distance comparisons made vs brute-force N — that gap is the speedup.
Interview questions & answers
Why is HNSW approximate, not exact?
Greedy graph traversal can settle in a local optimum and miss the true nearest neighbor. You trade a little recall for a huge speedup; parameters (efSearch, M) tune that trade-off.
What do M and efSearch control?
M = neighbors per node (graph degree): higher M ⇒ better recall, more memory. efSearch = size of the candidate list during search: higher ⇒ better recall, slower. They tune the accuracy/latency/memory trade-off.
HNSW vs IVF (inverted file) indexes?
HNSW is a graph index: great recall/latency, higher memory, costly updates. IVF clusters vectors and scans a few cells: lower memory, easy to shard, slightly lower recall. Many systems combine IVF + PQ for compression.
Why the hierarchical layers instead of one flat graph?
The sparse upper layers provide long-range jumps so the search reaches the right neighborhood in O(log N) hops, instead of walking many short edges on a single dense graph.
Common pitfalls
- Assuming ANN results are exact — validate recall for your use case.
- Ignoring memory: HNSW graphs are large (M links per vector).
- Rebuilding from scratch on every update; plan for incremental inserts/deletes.
Where it shows up
- Vector DBs: Pinecone, Weaviate, Qdrant, Milvus, pgvector, FAISS (HNSW)
- Semantic search / RAG retrieval at scale
- Recommendation & dedup over embeddings
More AI / ML interview concepts
- Neural Networks & Backpropagation
- Gradient Descent & Optimizers
- Activation Functions
- K-Means Clustering
- Self-Attention
- Multi-Head Attention
- Softmax, Temperature & Sampling
- Tokenization (Byte-Pair Encoding)
- Positional Encoding
- KV Cache
- Rotary Position Embedding (RoPE)
- The Transformer Block
- Normalization (LayerNorm / RMSNorm)
- Multi-Query & Grouped-Query Attention
- Flash Attention
- Decoding: Beam Search & Speculative Decoding
- Embeddings & Cosine Similarity
- RAG (Retrieval-Augmented Generation) Pipeline
- Chunking & Reranking
- ReAct Agent Loop
- Tool / Function Calling
- Multi-Agent Orchestration
- Planning & Task Decomposition
- Agent Memory
- Model Context Protocol (MCP)
- Quantization
- LoRA / PEFT Fine-Tuning
- Mixture of Experts (MoE)
- RLHF / DPO Alignment
- Evals & LLM-as-Judge
- Prompt Injection & Guardrails
- Knowledge Distillation
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…