📚 Design RAG System — System Design Interview Guide
Hard · AI & ML Systems
Design a Retrieval-Augmented Generation system: ground an LLM's answers in your own corpus so it cites real sources and avoids hallucination — e.g. an enterprise knowledge assistant.
Open the interactive RAG System design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.
Functional requirements
- Answer questions using retrieved context from a private corpus
- Cite the source chunks used to generate each answer
- Keep the index fresh as source documents change
- Respect per-user document permissions at retrieval time
- Offline evaluation of answer quality (faithfulness, relevance)
Non-functional requirements & scale
- 50M document chunks across 2M source documents
- End-to-end answer TTFT p95 < 2s (retrieve + generate)
- Answer faithfulness (grounded-in-context) > 95%
- Index freshness < 5 min after a document update
- No data leakage across permission boundaries
Capacity estimation
RAG = retrieval (the Semantic Search problem) + generation (the Chatbot problem) wired together with careful prompt assembly and evaluation. Latency budget splits across embed → ANN search → re-rank → LLM generate. The defining challenges are grounding/faithfulness, permissions, and keeping the index in sync with the source of truth.
Core entities
- Document — docId, source, version, updatedAt, aclGroups[]
- Chunk — chunkId, docId, text, embedding, metadata
- Query — queryText, userId, retrievedChunkIds[], filters
- Answer — answerId, query, generatedText, citations[], faithfulnessScore
API design
POST /api/v1/ask— Body: { query, conversationId? }. Returns a grounded, streamed answer with citations.POST /api/v1/ingest— Ingest/refresh a source document into the index (async).GET /api/v1/answers/:id/citations— Return the source chunks backing an answer.
High-level design
A CDC/ingest pipeline chunks and embeds source documents into the Vector DB. On a question, the RAG Orchestrator embeds the query, retrieves candidate chunks (filtered by the user's permissions), re-ranks them, assembles a grounded prompt (instructions + context + question), and calls the LLM with streaming. Citations map generated claims back to retrieved chunks. An async eval job scores faithfulness.
Deep dives
🧩 Prompt Assembly & Grounding
Build the prompt as: system instructions ("answer ONLY from the context; if unknown, say so") + the top-K retrieved chunks (with source IDs) + the user question. Fit it to the context budget — too many chunks dilute attention and raise cost; too few hurt recall. Force the model to cite chunk IDs so claims are traceable. This grounding is what separates RAG from a bare chatbot.
🎯 Retrieval Quality & Re-ranking
First-stage ANN retrieves ~50–100 candidates cheaply (high recall, loose precision). A cross-encoder re-ranker then scores (query, chunk) pairs jointly and keeps the top ~5–8 for the prompt. This two-stage funnel is the biggest lever on answer quality. Add query rewriting/HyDE for vague questions and multi-query expansion for recall.
🔄 Index Freshness (CDC)
Stale context produces confidently wrong answers. Hook source systems via change-data-capture: on update, re-chunk and re-embed only the changed document and upsert (with version + tombstone for deletes). Track per-doc version so retrieval never mixes old and new chunks of the same document.
📏 Evaluation & Hallucination Guarding
You cannot improve what you cannot measure. Maintain a golden Q/A set and score faithfulness (is every claim supported by retrieved context?), answer relevance, and context precision/recall — often with an LLM-as-judge plus human spot-checks. Gate releases on these metrics. At serve time, optionally verify the answer against its citations before returning.
🔐 Permission-Aware Retrieval
Filter candidates by the requesting user's ACL groups DURING retrieval — never retrieve-then-filter, which can starve results and leak existence. Store aclGroups on each chunk's metadata and pass them as a hard pre-filter to the Vector DB so a user can only be grounded in documents they may see.
Scaling considerations
- Decompose into Retrieval, Re-rank, and Generation services that scale independently
- Cache (query → retrieved chunks) and reuse across similar questions
- Re-ranker and embedder are GPU-bound — batch and autoscale them separately
- Version indexes so a model/embedding upgrade is a safe blue-green rebuild
- Stream the final answer; retrieval + re-rank happen before first token
What interviewers expect by level
- Junior: Explain why RAG reduces hallucination and the retrieve→augment→generate flow.
- Mid: Design chunking + embedding + ANN retrieval, prompt assembly with citations, basic freshness.
- Senior: Two-stage retrieval with cross-encoder re-ranking, permission-aware pre-filtering, CDC index updates, caching.
- Staff: Faithfulness evaluation harness + release gating, index versioning/blue-green re-embedding, multi-service capacity planning, cost/quality trade-offs.
Practice more system design case studies
- Design URL Shortener
- Design Social Media Feed
- Design Chat System
- Design Video Streaming
- Design Ride-Sharing Platform
- Design E-Commerce Platform
- Design UPI Payment Gateway
- Design Google Docs
- Design Tinder
- Design Google Drive / Dropbox
- Design Instagram
- Design Type-Ahead Search
- Design Web Crawler
- Design Ticket Booking (BookMyShow)
- Design Pastebin
- Design Notification System
- Design Rate Limiter (Standalone)
- Design Simple Web App
- Design Food Delivery (Swiggy)
- Design Stock Trading System
- Design Live Streaming (Twitch)
- Design Distributed Key-Value Store
- Design Ad Click Aggregation
- Design Monitoring / Metrics (Datadog)
- Design Online Judge (LeetCode)
- Design FB Post Search
- Design Yelp
- Design Cache Layer
- Design Message Queue
- Design Full Production Stack
- Design AI Chatbot
- Design Semantic Search
- Design LLM Serving Platform
- Design Recommendation System
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…