🤖 Design AI Chatbot — System Design Interview Guide

Medium · AI & ML Systems

Design a ChatGPT-style conversational AI service: users send messages, the system streams back model-generated responses while preserving multi-turn conversation context.

Open the interactive AI Chatbot design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

Non-functional requirements & scale

Capacity estimation

At ~30 msgs/sec average (5–10× peak), each completion ~500 output tokens. Cost is dominated by tokens, so context trimming matters. LLM calls are slow (seconds) and capacity-constrained — the architecture is about streaming, queueing, and context management, not raw QPS.

Core entities

API design

High-level design

Client opens an SSE/WebSocket connection through the gateway to the Chat Service. The Chat Service loads recent history from the message store, assembles the prompt within the context budget, calls the LLM with streaming enabled, relays tokens to the client, and asynchronously persists the completed assistant message. Redis caches hot conversation context; a moderation check gates both input and output.

Deep dives

⚡ Token Streaming

Generation is autoregressive — tokens emerge one at a time. Hold an SSE (or WebSocket) connection from client → gateway → chat service and forward each token as it arrives from the model. Perceived latency becomes TTFT, not full-completion time. Critical: gateways/CDNs must NOT buffer the response (disable proxy buffering), and clients should send a "resume from token N" cursor so a dropped mobile connection can reconnect mid-answer.

🧠 Context Window Management

The model only sees what fits in its window (e.g. 128K tokens). For long chats: keep a rolling summary of old turns + the last N verbatim messages + the system prompt. Budget = context_window − max_output − system_prompt. Cache the assembled context per conversation in Redis to avoid re-reading the full history on every turn.

🛡️ Moderation & Safety

Screen BOTH input (block prompt-injection / disallowed requests) and output (filter unsafe generations) with a fast classifier before tokens reach the user. Since output is streamed, moderate in a sliding window and be ready to halt the stream. Log flagged events for review.

💸 Cost & Model Routing

Tokens are the bill. Route simple/short queries to a small fast model and only escalate to a frontier model when needed (length, complexity, or explicit user choice). Enforce per-user monthly token budgets. Cache identical system-prompt prefixes (prompt caching) to cut input cost.

Scaling considerations

What interviewers expect by level

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…