🧠 RAG (Retrieval-Augmented Generation) Pipeline — AI / ML Interview Guide

LLM Applications · interactive visualization + interview prep

Open the interactive RAG (Retrieval-Augmented Generation) Pipeline visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.

What it is

An LLM only knows what was in its training data. RAG gives it fresh, specific knowledge at query time: embed the question, retrieve the most similar chunks from your documents, paste them into the prompt as context, and let the model answer grounded in those chunks instead of guessing.

Mental model

Closed-book vs open-book exam. A raw LLM answers from memory and bluffs when it doesn't know (hallucinates). RAG turns it into an open-book test: BEFORE answering, fetch the most relevant pages from YOUR documents and put them on the desk. The model now reads instead of recalling — answers are grounded and citable, and you can swap the "book" any time by re-indexing, with no retraining.

Theory

RAG exists because a model's weights are a frozen, lossy snapshot of its training data: it has a knowledge cutoff, cannot see your private data, and confidently hallucinates when asked beyond what it learned. Retrieval-Augmented Generation side-steps all three by injecting relevant external knowledge into the prompt at inference time, so the answer is conditioned on retrieved facts rather than parametric memory.

The architecture has two phases. OFFLINE (indexing): split documents into chunks, embed each chunk with an embedding model, and store the vectors in a vector DB. ONLINE (querying): embed the user question, retrieve the top-k most similar chunks (cosine similarity, usually via an ANN index like HNSW), stitch those chunks + the question into a prompt, and let the LLM generate an answer grounded in them.

RAG vs fine-tuning is a classic interview axis. RAG adds KNOWLEDGE at inference — cheap to update (re-index), auditable (you can show sources), and easy to scope per user/tenant. Fine-tuning changes BEHAVIOR/style and bakes knowledge into weights — powerful for tone and format, but costly to update and hard to attribute. They are complementary, not either/or.

Quality is dominated by RETRIEVAL, not the LLM — "garbage in, garbage out". If the right chunk is never retrieved, no model can answer correctly. So the levers are: chunking strategy, the embedding model, k, hybrid (keyword + vector) search, and a reranker to fix ordering (see the Chunking & Reranking concept). When a RAG system is wrong, you inspect the retrieved chunks first.

The known failure modes shape advanced designs. Stuffing too many chunks costs tokens/latency and triggers "lost in the middle" (models under-use middle context). A stale index means stale answers. Missing source attribution means users can't verify and you can't debug. These drive query rewriting, reranking, hybrid search, and agentic RAG (the model iteratively decides what to retrieve).

Concrete example

A support bot for your product: docs are chunked and embedded into a vector DB. A user asks "how do I reset my password?" — you embed the question, retrieve the top-k matching help-doc chunks, and prompt the LLM with them, so the answer cites YOUR docs instead of hallucinating.

Key equations

offline: chunk docs → embed each chunk → store vectors in a vector DB
query: q_vec = embed(question)
retrieve: top-k chunks by cosine(q_vec, chunkᵢ)
augment: prompt = context(top-k chunks) + question
generate: answer = LLM(prompt) — grounded in retrieved context

Step by step

Query — the user’s question enters the pipeline.
Embed — the question becomes a vector.
Retrieve — score every chunk by similarity; keep the top-k.
Augment — stitch the kept chunks + question into a prompt.
Generate — the LLM answers using that context (change the query / top-k to see retrieval shift).

Interview questions & answers

Why use RAG instead of fine-tuning?

RAG injects knowledge at inference time — cheap to update (just re-index docs), auditable (you see the sources), and avoids retraining. Fine-tuning changes behavior/style and bakes knowledge in, but is costly to update and harder to attribute.

Your RAG gives wrong answers — where do you look?

Usually retrieval, not the LLM: bad chunking (too big/small), a weak embedding model, wrong k, or missing a reranker. Inspect the retrieved chunks first; if the right context isn’t retrieved, the model can’t answer correctly.

What is chunking and why does it matter?

Splitting documents into retrievable units. Too large → noisy, dilutes the embedding; too small → loses context. Overlapping, semantically-aware chunks retrieve better.

What does a reranker add?

Vector search is fast but approximate. A cross-encoder reranker re-scores the top candidates with the query+chunk together for higher precision, reordering before they hit the prompt.

Common pitfalls

Treating bad answers as an LLM problem when retrieval is the real failure.
Stuffing too many chunks into context — cost, latency, and "lost in the middle".
No source attribution — users can’t verify, and you can’t debug.
Stale index — RAG is only as fresh as your last re-embedding run.

Where it shows up

Doc Q&A / support bots / enterprise search
Vector DBs: Pinecone, Weaviate, pgvector, FAISS
LangChain / LlamaIndex retrieval chains

More AI / ML interview concepts

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…