📚 Design RAG System — System Design Interview Guide

Hard · AI & ML Systems

Design a Retrieval-Augmented Generation system: ground an LLM's answers in your own corpus so it cites real sources and avoids hallucination — e.g. an enterprise knowledge assistant.

Open the interactive RAG System design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

Answer questions using retrieved context from a private corpus
Cite the source chunks used to generate each answer
Keep the index fresh as source documents change
Respect per-user document permissions at retrieval time
Offline evaluation of answer quality (faithfulness, relevance)

Non-functional requirements & scale

50M document chunks across 2M source documents
End-to-end answer TTFT p95 < 2s (retrieve + generate)
Answer faithfulness (grounded-in-context) > 95%
Index freshness < 5 min after a document update
No data leakage across permission boundaries

Capacity estimation

RAG = retrieval (the Semantic Search problem) + generation (the Chatbot problem) wired together with careful prompt assembly and evaluation. Latency budget splits across embed → ANN search → re-rank → LLM generate. The defining challenges are grounding/faithfulness, permissions, and keeping the index in sync with the source of truth.

Core entities

Document — docId, source, version, updatedAt, aclGroups[]
Chunk — chunkId, docId, text, embedding, metadata
Query — queryText, userId, retrievedChunkIds[], filters
Answer — answerId, query, generatedText, citations[], faithfulnessScore

API design

POST /api/v1/ask — Body: { query, conversationId? }. Returns a grounded, streamed answer with citations.
POST /api/v1/ingest — Ingest/refresh a source document into the index (async).
GET /api/v1/answers/:id/citations — Return the source chunks backing an answer.

High-level design

A CDC/ingest pipeline chunks and embeds source documents into the Vector DB. On a question, the RAG Orchestrator embeds the query, retrieves candidate chunks (filtered by the user's permissions), re-ranks them, assembles a grounded prompt (instructions + context + question), and calls the LLM with streaming. Citations map generated claims back to retrieved chunks. An async eval job scores faithfulness.

Deep dives

🧩 Prompt Assembly & Grounding

Build the prompt as: system instructions ("answer ONLY from the context; if unknown, say so") + the top-K retrieved chunks (with source IDs) + the user question. Fit it to the context budget — too many chunks dilute attention and raise cost; too few hurt recall. Force the model to cite chunk IDs so claims are traceable. This grounding is what separates RAG from a bare chatbot.

🎯 Retrieval Quality & Re-ranking

First-stage ANN retrieves ~50–100 candidates cheaply (high recall, loose precision). A cross-encoder re-ranker then scores (query, chunk) pairs jointly and keeps the top ~5–8 for the prompt. This two-stage funnel is the biggest lever on answer quality. Add query rewriting/HyDE for vague questions and multi-query expansion for recall.

🔄 Index Freshness (CDC)

Stale context produces confidently wrong answers. Hook source systems via change-data-capture: on update, re-chunk and re-embed only the changed document and upsert (with version + tombstone for deletes). Track per-doc version so retrieval never mixes old and new chunks of the same document.

📏 Evaluation & Hallucination Guarding

You cannot improve what you cannot measure. Maintain a golden Q/A set and score faithfulness (is every claim supported by retrieved context?), answer relevance, and context precision/recall — often with an LLM-as-judge plus human spot-checks. Gate releases on these metrics. At serve time, optionally verify the answer against its citations before returning.

🔐 Permission-Aware Retrieval

Filter candidates by the requesting user's ACL groups DURING retrieval — never retrieve-then-filter, which can starve results and leak existence. Store aclGroups on each chunk's metadata and pass them as a hard pre-filter to the Vector DB so a user can only be grounded in documents they may see.

Scaling considerations

Decompose into Retrieval, Re-rank, and Generation services that scale independently
Cache (query → retrieved chunks) and reuse across similar questions
Re-ranker and embedder are GPU-bound — batch and autoscale them separately
Version indexes so a model/embedding upgrade is a safe blue-green rebuild
Stream the final answer; retrieval + re-rank happen before first token

What interviewers expect by level

Junior: Explain why RAG reduces hallucination and the retrieve→augment→generate flow.
Mid: Design chunking + embedding + ANN retrieval, prompt assembly with citations, basic freshness.
Senior: Two-stage retrieval with cross-encoder re-ranking, permission-aware pre-filtering, CDC index updates, caching.
Staff: Faithfulness evaluation harness + release gating, index versioning/blue-green re-embedding, multi-service capacity planning, cost/quality trade-offs.

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…