🤖 Design AI Chatbot — System Design Interview Guide

Medium · AI & ML Systems

Design a ChatGPT-style conversational AI service: users send messages, the system streams back model-generated responses while preserving multi-turn conversation context.

Open the interactive AI Chatbot design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

Multi-turn chat: responses use prior conversation history
Stream tokens back to the client as they are generated
Persist conversations so users can resume them later
Per-user rate limiting and abuse / content moderation
Support multiple models (fast/cheap vs smart/expensive)

Non-functional requirements & scale

500K daily active users, ~5 messages each (~2.5M msgs/day)
Time-to-first-token (TTFT) p95 < 800ms
Streaming must survive flaky mobile networks (resumable)
Conversations retained for 1 year; encrypted at rest
Graceful degradation when the model provider is rate-limited

Capacity estimation

At ~30 msgs/sec average (5–10× peak), each completion ~500 output tokens. Cost is dominated by tokens, so context trimming matters. LLM calls are slow (seconds) and capacity-constrained — the architecture is about streaming, queueing, and context management, not raw QPS.

Core entities

Conversation — conversationId (PK), userId, title, model, createdAt, updatedAt
Message — messageId, conversationId, role (user/assistant/system), content, tokenCount, createdAt
User — userId, plan (free/pro), rateLimitRemaining, monthlyTokenBudget

API design

POST /api/v1/conversations — Create a conversation. Returns { conversationId }.
POST /api/v1/conversations/:id/messages — Send a message. Returns an SSE stream of assistant tokens.
GET /api/v1/conversations/:id — Fetch full message history (paginated).
GET /api/v1/conversations — List the user's conversations.

High-level design

Client opens an SSE/WebSocket connection through the gateway to the Chat Service. The Chat Service loads recent history from the message store, assembles the prompt within the context budget, calls the LLM with streaming enabled, relays tokens to the client, and asynchronously persists the completed assistant message. Redis caches hot conversation context; a moderation check gates both input and output.

Deep dives

⚡ Token Streaming

Generation is autoregressive — tokens emerge one at a time. Hold an SSE (or WebSocket) connection from client → gateway → chat service and forward each token as it arrives from the model. Perceived latency becomes TTFT, not full-completion time. Critical: gateways/CDNs must NOT buffer the response (disable proxy buffering), and clients should send a "resume from token N" cursor so a dropped mobile connection can reconnect mid-answer.

🧠 Context Window Management

The model only sees what fits in its window (e.g. 128K tokens). For long chats: keep a rolling summary of old turns + the last N verbatim messages + the system prompt. Budget = context_window − max_output − system_prompt. Cache the assembled context per conversation in Redis to avoid re-reading the full history on every turn.

🛡️ Moderation & Safety

Screen BOTH input (block prompt-injection / disallowed requests) and output (filter unsafe generations) with a fast classifier before tokens reach the user. Since output is streamed, moderate in a sliding window and be ready to halt the stream. Log flagged events for review.

💸 Cost & Model Routing

Tokens are the bill. Route simple/short queries to a small fast model and only escalate to a frontier model when needed (length, complexity, or explicit user choice). Enforce per-user monthly token budgets. Cache identical system-prompt prefixes (prompt caching) to cut input cost.

Scaling considerations

Chat service is I/O-bound (waiting on the model) — use async workers, high connection concurrency
Queue requests when the model provider is saturated; apply backpressure + retries with jitter
Cassandra message store partitioned by conversationId for fast history range reads
Sticky routing so a streaming connection stays on one chat-service instance
Fallback to a secondary model/provider on 429/5xx from the primary

What interviewers expect by level

Junior: Describe request → LLM → response, storing messages in a DB, and why we stream tokens.
Mid: Design SSE streaming end-to-end, context-window trimming, Redis context cache, per-user rate limits.
Senior: Add moderation on input/output, model routing for cost, resumable streams, provider failover, token budgets.
Staff: Multi-provider abstraction, capacity planning for token throughput, prompt-caching strategy, safety/compliance + audit logging.

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…