🤖 Design AI Chatbot — System Design Interview Guide
Medium · AI & ML Systems
Design a ChatGPT-style conversational AI service: users send messages, the system streams back model-generated responses while preserving multi-turn conversation context.
Open the interactive AI Chatbot design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.
Functional requirements
- Multi-turn chat: responses use prior conversation history
- Stream tokens back to the client as they are generated
- Persist conversations so users can resume them later
- Per-user rate limiting and abuse / content moderation
- Support multiple models (fast/cheap vs smart/expensive)
Non-functional requirements & scale
- 500K daily active users, ~5 messages each (~2.5M msgs/day)
- Time-to-first-token (TTFT) p95 < 800ms
- Streaming must survive flaky mobile networks (resumable)
- Conversations retained for 1 year; encrypted at rest
- Graceful degradation when the model provider is rate-limited
Capacity estimation
At ~30 msgs/sec average (5–10× peak), each completion ~500 output tokens. Cost is dominated by tokens, so context trimming matters. LLM calls are slow (seconds) and capacity-constrained — the architecture is about streaming, queueing, and context management, not raw QPS.
Core entities
- Conversation — conversationId (PK), userId, title, model, createdAt, updatedAt
- Message — messageId, conversationId, role (user/assistant/system), content, tokenCount, createdAt
- User — userId, plan (free/pro), rateLimitRemaining, monthlyTokenBudget
API design
POST /api/v1/conversations— Create a conversation. Returns { conversationId }.POST /api/v1/conversations/:id/messages— Send a message. Returns an SSE stream of assistant tokens.GET /api/v1/conversations/:id— Fetch full message history (paginated).GET /api/v1/conversations— List the user's conversations.
High-level design
Client opens an SSE/WebSocket connection through the gateway to the Chat Service. The Chat Service loads recent history from the message store, assembles the prompt within the context budget, calls the LLM with streaming enabled, relays tokens to the client, and asynchronously persists the completed assistant message. Redis caches hot conversation context; a moderation check gates both input and output.
Deep dives
⚡ Token Streaming
Generation is autoregressive — tokens emerge one at a time. Hold an SSE (or WebSocket) connection from client → gateway → chat service and forward each token as it arrives from the model. Perceived latency becomes TTFT, not full-completion time. Critical: gateways/CDNs must NOT buffer the response (disable proxy buffering), and clients should send a "resume from token N" cursor so a dropped mobile connection can reconnect mid-answer.
🧠 Context Window Management
The model only sees what fits in its window (e.g. 128K tokens). For long chats: keep a rolling summary of old turns + the last N verbatim messages + the system prompt. Budget = context_window − max_output − system_prompt. Cache the assembled context per conversation in Redis to avoid re-reading the full history on every turn.
🛡️ Moderation & Safety
Screen BOTH input (block prompt-injection / disallowed requests) and output (filter unsafe generations) with a fast classifier before tokens reach the user. Since output is streamed, moderate in a sliding window and be ready to halt the stream. Log flagged events for review.
💸 Cost & Model Routing
Tokens are the bill. Route simple/short queries to a small fast model and only escalate to a frontier model when needed (length, complexity, or explicit user choice). Enforce per-user monthly token budgets. Cache identical system-prompt prefixes (prompt caching) to cut input cost.
Scaling considerations
- Chat service is I/O-bound (waiting on the model) — use async workers, high connection concurrency
- Queue requests when the model provider is saturated; apply backpressure + retries with jitter
- Cassandra message store partitioned by conversationId for fast history range reads
- Sticky routing so a streaming connection stays on one chat-service instance
- Fallback to a secondary model/provider on 429/5xx from the primary
What interviewers expect by level
- Junior: Describe request → LLM → response, storing messages in a DB, and why we stream tokens.
- Mid: Design SSE streaming end-to-end, context-window trimming, Redis context cache, per-user rate limits.
- Senior: Add moderation on input/output, model routing for cost, resumable streams, provider failover, token budgets.
- Staff: Multi-provider abstraction, capacity planning for token throughput, prompt-caching strategy, safety/compliance + audit logging.
Practice more system design case studies
- Design URL Shortener
- Design Social Media Feed
- Design Chat System
- Design Video Streaming
- Design Ride-Sharing Platform
- Design E-Commerce Platform
- Design UPI Payment Gateway
- Design Google Docs
- Design Tinder
- Design Google Drive / Dropbox
- Design Instagram
- Design Type-Ahead Search
- Design Web Crawler
- Design Ticket Booking (BookMyShow)
- Design Pastebin
- Design Notification System
- Design Rate Limiter (Standalone)
- Design Simple Web App
- Design Food Delivery (Swiggy)
- Design Stock Trading System
- Design Live Streaming (Twitch)
- Design Distributed Key-Value Store
- Design Ad Click Aggregation
- Design Monitoring / Metrics (Datadog)
- Design Online Judge (LeetCode)
- Design FB Post Search
- Design Yelp
- Design Cache Layer
- Design Message Queue
- Design Full Production Stack
- Design Semantic Search
- Design RAG System
- Design LLM Serving Platform
- Design Recommendation System
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…