🎛️ Design LLM Serving Platform — System Design Interview Guide
Hard · AI & ML Systems
Design a platform that serves self-hosted LLMs at scale on a GPU fleet — maximizing throughput (tokens/sec) and GPU utilization while keeping tail latency low, like a vLLM/Triton-based inference service.
Open the interactive LLM Serving Platform design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.
Functional requirements
- Serve multiple models behind a single OpenAI-compatible API
- Stream tokens with low time-to-first-token
- Continuous batching to maximize GPU utilization
- Autoscale GPU workers on queue depth / load
- Fair scheduling + priority tiers across tenants
Non-functional requirements & scale
- Fleet of hundreds of GPUs (A100/H100), multiple model families
- TTFT p95 < 500ms; sustained > 10K tokens/sec per large model
- GPU utilization > 70% (idle GPUs are the dominant cost)
- Graceful overload: shed/queue rather than crash under spikes
- Zero-downtime model rollouts and weight updates
Capacity estimation
GPUs are the scarce, expensive resource — the entire design optimizes their utilization. Requests have wildly variable output lengths, so static batching wastes the GPU. The core techniques are continuous (in-flight) batching, KV-cache management (PagedAttention), quantization, and a scheduler that balances throughput against per-request tail latency.
Core entities
- InferenceRequest — requestId, model, prompt, maxTokens, priority, tenantId, stream
- ModelReplica — replicaId, model, gpuType, kvCacheFree, activeSeqs, state
- BatchSlot — sequenceId, requestId, generatedTokens, kvBlocks[]
API design
POST /v1/chat/completions— OpenAI-compatible. Body: { model, messages, stream }. SSE token stream.POST /v1/embeddings— Batch embedding inference for a list of inputs.GET /v1/models— List served models and their readiness/capacity.
High-level design
A router authenticates, applies rate limits, and places each request on the queue for the target model. A scheduler pulls requests into a continuously-batched run loop on GPU workers, managing the KV cache per sequence. Tokens stream back through the router to the client. A control plane handles model loading, autoscaling on queue depth, and zero-downtime weight rollouts.
Deep dives
🔁 Continuous (In-Flight) Batching
Static batching waits to fill a batch and is held hostage by the longest sequence. Continuous batching mutates the running batch every decode step: finished sequences free their slot immediately and waiting requests join without draining — pushing GPU utilization from ~30% to 70%+ and 2–20× throughput. The scheduler decides, each step, which sequences run given KV-cache memory.
🧮 KV-Cache & PagedAttention
Each active sequence stores per-token key/value tensors — often gigabytes — and naive contiguous allocation fragments VRAM badly. PagedAttention (vLLM) pages the KV cache into fixed blocks like virtual memory, eliminating fragmentation, enabling higher concurrency, and allowing prefix sharing (identical system prompts reuse KV blocks). KV memory, not FLOPs, is usually the concurrency limit.
📉 Quantization & Parallelism
int8/int4 quantization (GPTQ/AWQ) halves or quarters VRAM, letting bigger models or larger batches fit per GPU, with a small quality hit. Models too big for one GPU use tensor parallelism (split layers across GPUs, high comms) or pipeline parallelism (split by stage). Choose based on model size vs interconnect (NVLink vs PCIe).
📊 Scheduling, Autoscaling & Overload
Scale GPU workers on queue depth and TTFT, not CPU. GPUs are slow to spin up (load 10s–100s of GB of weights), so keep warm pools and pre-load popular models. Under overload, shed or queue with admission control and priority tiers rather than letting latency explode. Separate prefill (compute-bound) from decode (memory-bound) for better scheduling.
Scaling considerations
- Bin-pack models onto GPUs; co-locate small models, dedicate GPUs to large ones
- Warm pools + fast weight loading to hide cold-start (weights are huge)
- Tensor/pipeline parallelism for models that exceed a single GPU's VRAM
- Prefix/prompt-cache sharing across requests with identical system prompts
- Per-tenant quotas + priority queues to protect latency under contention
What interviewers expect by level
- Junior: Explain why GPUs are the bottleneck and what batching/streaming buy us.
- Mid: Design the router→queue→GPU-worker path with streaming and basic batching + autoscaling.
- Senior: Continuous batching, KV-cache/PagedAttention, quantization, queue-depth autoscaling, overload shedding.
- Staff: Multi-model bin-packing, tensor/pipeline parallelism, prefill/decode disaggregation, capacity + cost modeling, zero-downtime rollouts.
Practice more system design case studies
- Design URL Shortener
- Design Social Media Feed
- Design Chat System
- Design Video Streaming
- Design Ride-Sharing Platform
- Design E-Commerce Platform
- Design UPI Payment Gateway
- Design Google Docs
- Design Tinder
- Design Google Drive / Dropbox
- Design Instagram
- Design Type-Ahead Search
- Design Web Crawler
- Design Ticket Booking (BookMyShow)
- Design Pastebin
- Design Notification System
- Design Rate Limiter (Standalone)
- Design Simple Web App
- Design Food Delivery (Swiggy)
- Design Stock Trading System
- Design Live Streaming (Twitch)
- Design Distributed Key-Value Store
- Design Ad Click Aggregation
- Design Monitoring / Metrics (Datadog)
- Design Online Judge (LeetCode)
- Design FB Post Search
- Design Yelp
- Design Cache Layer
- Design Message Queue
- Design Full Production Stack
- Design AI Chatbot
- Design Semantic Search
- Design RAG System
- Design Recommendation System
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…