🎛️ Design LLM Serving Platform — System Design Interview Guide

Hard · AI & ML Systems

Design a platform that serves self-hosted LLMs at scale on a GPU fleet — maximizing throughput (tokens/sec) and GPU utilization while keeping tail latency low, like a vLLM/Triton-based inference service.

Open the interactive LLM Serving Platform design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

Serve multiple models behind a single OpenAI-compatible API
Stream tokens with low time-to-first-token
Continuous batching to maximize GPU utilization
Autoscale GPU workers on queue depth / load
Fair scheduling + priority tiers across tenants

Non-functional requirements & scale

Fleet of hundreds of GPUs (A100/H100), multiple model families
TTFT p95 < 500ms; sustained > 10K tokens/sec per large model
GPU utilization > 70% (idle GPUs are the dominant cost)
Graceful overload: shed/queue rather than crash under spikes
Zero-downtime model rollouts and weight updates

Capacity estimation

GPUs are the scarce, expensive resource — the entire design optimizes their utilization. Requests have wildly variable output lengths, so static batching wastes the GPU. The core techniques are continuous (in-flight) batching, KV-cache management (PagedAttention), quantization, and a scheduler that balances throughput against per-request tail latency.

Core entities

InferenceRequest — requestId, model, prompt, maxTokens, priority, tenantId, stream
ModelReplica — replicaId, model, gpuType, kvCacheFree, activeSeqs, state
BatchSlot — sequenceId, requestId, generatedTokens, kvBlocks[]

API design

POST /v1/chat/completions — OpenAI-compatible. Body: { model, messages, stream }. SSE token stream.
POST /v1/embeddings — Batch embedding inference for a list of inputs.
GET /v1/models — List served models and their readiness/capacity.

High-level design

A router authenticates, applies rate limits, and places each request on the queue for the target model. A scheduler pulls requests into a continuously-batched run loop on GPU workers, managing the KV cache per sequence. Tokens stream back through the router to the client. A control plane handles model loading, autoscaling on queue depth, and zero-downtime weight rollouts.

Deep dives

🔁 Continuous (In-Flight) Batching

Static batching waits to fill a batch and is held hostage by the longest sequence. Continuous batching mutates the running batch every decode step: finished sequences free their slot immediately and waiting requests join without draining — pushing GPU utilization from ~30% to 70%+ and 2–20× throughput. The scheduler decides, each step, which sequences run given KV-cache memory.

🧮 KV-Cache & PagedAttention

Each active sequence stores per-token key/value tensors — often gigabytes — and naive contiguous allocation fragments VRAM badly. PagedAttention (vLLM) pages the KV cache into fixed blocks like virtual memory, eliminating fragmentation, enabling higher concurrency, and allowing prefix sharing (identical system prompts reuse KV blocks). KV memory, not FLOPs, is usually the concurrency limit.

📉 Quantization & Parallelism

int8/int4 quantization (GPTQ/AWQ) halves or quarters VRAM, letting bigger models or larger batches fit per GPU, with a small quality hit. Models too big for one GPU use tensor parallelism (split layers across GPUs, high comms) or pipeline parallelism (split by stage). Choose based on model size vs interconnect (NVLink vs PCIe).

📊 Scheduling, Autoscaling & Overload

Scale GPU workers on queue depth and TTFT, not CPU. GPUs are slow to spin up (load 10s–100s of GB of weights), so keep warm pools and pre-load popular models. Under overload, shed or queue with admission control and priority tiers rather than letting latency explode. Separate prefill (compute-bound) from decode (memory-bound) for better scheduling.

Scaling considerations

Bin-pack models onto GPUs; co-locate small models, dedicate GPUs to large ones
Warm pools + fast weight loading to hide cold-start (weights are huge)
Tensor/pipeline parallelism for models that exceed a single GPU's VRAM
Prefix/prompt-cache sharing across requests with identical system prompts
Per-tenant quotas + priority queues to protect latency under contention

What interviewers expect by level

Junior: Explain why GPUs are the bottleneck and what batching/streaming buy us.
Mid: Design the router→queue→GPU-worker path with streaming and basic batching + autoscaling.
Senior: Continuous batching, KV-cache/PagedAttention, quantization, queue-depth autoscaling, overload shedding.
Staff: Multi-model bin-packing, tensor/pipeline parallelism, prefill/decode disaggregation, capacity + cost modeling, zero-downtime rollouts.

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…