🎛️ Design LLM Serving Platform — System Design Interview Guide

Hard · AI & ML Systems

Design a platform that serves self-hosted LLMs at scale on a GPU fleet — maximizing throughput (tokens/sec) and GPU utilization while keeping tail latency low, like a vLLM/Triton-based inference service.

Open the interactive LLM Serving Platform design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

Non-functional requirements & scale

Capacity estimation

GPUs are the scarce, expensive resource — the entire design optimizes their utilization. Requests have wildly variable output lengths, so static batching wastes the GPU. The core techniques are continuous (in-flight) batching, KV-cache management (PagedAttention), quantization, and a scheduler that balances throughput against per-request tail latency.

Core entities

API design

High-level design

A router authenticates, applies rate limits, and places each request on the queue for the target model. A scheduler pulls requests into a continuously-batched run loop on GPU workers, managing the KV cache per sequence. Tokens stream back through the router to the client. A control plane handles model loading, autoscaling on queue depth, and zero-downtime weight rollouts.

Deep dives

🔁 Continuous (In-Flight) Batching

Static batching waits to fill a batch and is held hostage by the longest sequence. Continuous batching mutates the running batch every decode step: finished sequences free their slot immediately and waiting requests join without draining — pushing GPU utilization from ~30% to 70%+ and 2–20× throughput. The scheduler decides, each step, which sequences run given KV-cache memory.

🧮 KV-Cache & PagedAttention

Each active sequence stores per-token key/value tensors — often gigabytes — and naive contiguous allocation fragments VRAM badly. PagedAttention (vLLM) pages the KV cache into fixed blocks like virtual memory, eliminating fragmentation, enabling higher concurrency, and allowing prefix sharing (identical system prompts reuse KV blocks). KV memory, not FLOPs, is usually the concurrency limit.

📉 Quantization & Parallelism

int8/int4 quantization (GPTQ/AWQ) halves or quarters VRAM, letting bigger models or larger batches fit per GPU, with a small quality hit. Models too big for one GPU use tensor parallelism (split layers across GPUs, high comms) or pipeline parallelism (split by stage). Choose based on model size vs interconnect (NVLink vs PCIe).

📊 Scheduling, Autoscaling & Overload

Scale GPU workers on queue depth and TTFT, not CPU. GPUs are slow to spin up (load 10s–100s of GB of weights), so keep warm pools and pre-load popular models. Under overload, shed or queue with admission control and priority tiers rather than letting latency explode. Separate prefill (compute-bound) from decode (memory-bound) for better scheduling.

Scaling considerations

What interviewers expect by level

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…