🧠 Chunking & Reranking — AI / ML Interview Guide

LLM Applications · interactive visualization + interview prep

Open the interactive Chunking & Reranking visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.

What it is

RAG quality lives or dies on retrieval. Two levers: HOW you split documents into chunks (sentence / fixed-size / overlapping) changes what can be retrieved; and a RERANKER re-scores the top vector hits with the query and chunk together, fixing the ordering that fast-but-approximate vector search gets wrong.

Mental model

A two-stage funnel. Stage 1 (retrieval) is a fast librarian: a bi-encoder embeds the query and every chunk SEPARATELY, so chunks can be pre-indexed and a shelf of ~50 candidates handed back cheaply — but coarsely. Stage 2 (rerank) is a careful editor: a cross-encoder reads each (query, chunk) PAIR together and reorders by true relevance, keeping the best ~5. Cheap wide recall, then expensive narrow precision — only on the few that survived the first stage.

Theory

In RAG, the generator is only as good as what it's given, so retrieval quality is the real bottleneck. Two largely independent levers control it: how documents are CHUNKED (what units can be retrieved) and whether a RERANKER refines the ordering before chunks reach the prompt.

Chunking strategy sets a granularity trade-off. Chunks that are too LARGE produce an embedding that averages many topics, diluting relevance and wasting context window; chunks too SMALL lose the surrounding context needed to actually answer. Sentence and fixed-window splits are simple baselines; OVERLAP carries context across boundaries so an answer that straddles two chunks still appears intact in at least one; semantic chunking splits on topic shifts.

The retrieval/rerank split comes from two model architectures. A BI-ENCODER encodes query and chunk independently into vectors and compares by cosine — fast and pre-indexable (you embed the whole corpus once), but coarse because the two never "see" each other. A CROSS-ENCODER feeds the (query, chunk) pair through the model together, attending across both, which is far more precise but must run per candidate at query time — impossibly slow over a whole corpus.

So production retrieval is two-stage: use the cheap bi-encoder (often over an ANN index) to RECALL ~50 candidates, then the expensive cross-encoder to RERANK them down to the best ~5. You get most of the cross-encoder's precision at a tiny fraction of the cost, because it only ever scores a small candidate set.

Practical guidance: tune chunk size and overlap to your content rather than defaulting to fixed 512-token blocks; don't feed raw vector hits straight to the LLM when precision matters; and know the middle-ground options like ColBERT (late interaction) that trade some cost for token-level matching between query and chunk.

Concrete example

A support bot returns a chunk that merely mentions "password" highest by cosine, but the chunk that actually explains the reset steps ranks 4th. A cross-encoder reranker reads (query, chunk) pairs and promotes the truly relevant chunk to #1 before it goes into the prompt.

Key equations

chunking: split doc → chunks (sentence | fixed-window | fixed+overlap)
retrieval score: cosine(query, chunk) — fast, length-normalized (bi-encoder)
rerank score: relevance of the (query, chunk) PAIR (cross-encoder)
final order = sort by rerank score, keep top-k for the prompt
two-stage: cheap recall (vector) → precise reorder (rerank)

Step by step

Pick a chunking strategy — see how the document splits.
Vector retrieval scores each chunk by cosine to the query (stage 1).
The reranker re-scores by how fully each chunk covers the query (stage 2).
Watch chunks move up/down between the two rankings.
The reranked top-k is what actually goes into the RAG prompt.

Interview questions & answers

Why add a reranker if vector search already returns top-k?

Bi-encoder vector search embeds query and chunk separately, so it’s fast but coarse. A cross-encoder reranker processes the (query, chunk) pair jointly for much higher precision — you retrieve ~50 cheaply, then rerank to the best ~5.

How does chunk size affect retrieval?

Too large: the embedding averages many topics, diluting relevance and wasting context. Too small: chunks lose the surrounding context needed to answer. Overlap preserves continuity across boundaries.

When is overlapping chunking worth the cost?

When answers straddle chunk boundaries — overlap ensures the full answer appears intact in at least one chunk, at the price of more chunks to store/search.

What’s the cost trade-off of reranking?

Cross-encoders are slow (they run the model per candidate), so you only rerank a small candidate set from vector search — recall stage cheap, precision stage targeted.

Common pitfalls

Blindly using fixed 512-token chunks for every corpus — tune to your content.
Skipping the reranker and feeding raw vector hits to the LLM.
No overlap, so answers split across boundaries are never retrieved whole.

Where it shows up

Production RAG retrieval pipelines
Cross-encoder rerankers: Cohere Rerank, BGE-reranker, ColBERT
LangChain / LlamaIndex retrievers + node post-processors

More AI / ML interview concepts

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…