📚 Design RAG System — System Design Interview Guide

Hard · AI & ML Systems

Design a Retrieval-Augmented Generation system: ground an LLM's answers in your own corpus so it cites real sources and avoids hallucination — e.g. an enterprise knowledge assistant.

Open the interactive RAG System design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

Non-functional requirements & scale

Capacity estimation

RAG = retrieval (the Semantic Search problem) + generation (the Chatbot problem) wired together with careful prompt assembly and evaluation. Latency budget splits across embed → ANN search → re-rank → LLM generate. The defining challenges are grounding/faithfulness, permissions, and keeping the index in sync with the source of truth.

Core entities

API design

High-level design

A CDC/ingest pipeline chunks and embeds source documents into the Vector DB. On a question, the RAG Orchestrator embeds the query, retrieves candidate chunks (filtered by the user's permissions), re-ranks them, assembles a grounded prompt (instructions + context + question), and calls the LLM with streaming. Citations map generated claims back to retrieved chunks. An async eval job scores faithfulness.

Deep dives

🧩 Prompt Assembly & Grounding

Build the prompt as: system instructions ("answer ONLY from the context; if unknown, say so") + the top-K retrieved chunks (with source IDs) + the user question. Fit it to the context budget — too many chunks dilute attention and raise cost; too few hurt recall. Force the model to cite chunk IDs so claims are traceable. This grounding is what separates RAG from a bare chatbot.

🎯 Retrieval Quality & Re-ranking

First-stage ANN retrieves ~50–100 candidates cheaply (high recall, loose precision). A cross-encoder re-ranker then scores (query, chunk) pairs jointly and keeps the top ~5–8 for the prompt. This two-stage funnel is the biggest lever on answer quality. Add query rewriting/HyDE for vague questions and multi-query expansion for recall.

🔄 Index Freshness (CDC)

Stale context produces confidently wrong answers. Hook source systems via change-data-capture: on update, re-chunk and re-embed only the changed document and upsert (with version + tombstone for deletes). Track per-doc version so retrieval never mixes old and new chunks of the same document.

📏 Evaluation & Hallucination Guarding

You cannot improve what you cannot measure. Maintain a golden Q/A set and score faithfulness (is every claim supported by retrieved context?), answer relevance, and context precision/recall — often with an LLM-as-judge plus human spot-checks. Gate releases on these metrics. At serve time, optionally verify the answer against its citations before returning.

🔐 Permission-Aware Retrieval

Filter candidates by the requesting user's ACL groups DURING retrieval — never retrieve-then-filter, which can starve results and leak existence. Store aclGroups on each chunk's metadata and pass them as a hard pre-filter to the Vector DB so a user can only be grounded in documents they may see.

Scaling considerations

What interviewers expect by level

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…