🧠 K-Means Clustering — AI / ML Interview Guide
Neural Foundations · interactive visualization + interview prep
Open the interactive K-Means Clustering visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.
What it is
K-means groups points into k clusters by repeating two steps: assign each point to its nearest centroid, then move each centroid to the average of its points. Repeat and the centroids settle into the centers of the natural groups. It’s unsupervised — no labels, just geometry.
Mental model
Two moves taking turns. ASSIGN: every point joins the nearest flag. UPDATE: every flag walks to the center of its crowd. Each move can only LOWER the total spread-around-flags (the WCSS), so repeating them slides downhill until nobody moves. That is the whole algorithm — coordinate descent on within-cluster variance, alternating between "who belongs to whom" and "where should the centers be".
Theory
K-means is unsupervised: with no labels, it partitions n points into k groups to minimize the within-cluster sum of squares, WCSS = Σ_clusters Σ_points ‖x − μ‖². The objective rewards tight, compact clusters; the only inputs are the points' geometry and the chosen k.
Lloyd's algorithm minimizes this by ALTERNATING minimization (a form of coordinate descent). Fixing the centroids, the best assignment is "nearest centroid"; fixing the assignment, the best centroid is the cluster MEAN (the mean is exactly the point minimizing squared distance to a set). Each step optimizes one variable group with the other held constant, so WCSS is non-increasing.
Because there are finitely many possible assignments and each step never raises WCSS, the algorithm is guaranteed to converge — but only to a LOCAL optimum that depends on the initial centroids. Bad initialization can give poor clusters, so k-means++ seeds centroids spread far apart, and practitioners run several restarts and keep the lowest WCSS.
The squared-Euclidean objective bakes in assumptions: clusters are implicitly assumed roughly spherical, similar in size, and similar in density. It therefore struggles with elongated, nested, or varying-density shapes — where Gaussian Mixture Models, DBSCAN, or spectral clustering do better. Feature scaling matters too, since a large-range feature dominates the distance.
Choosing k is the open question, since more clusters always reduce WCSS (k = n gives zero). Heuristics: the elbow method (find the bend in WCSS-vs-k), the silhouette score, or the gap statistic — but domain knowledge usually decides.
Concrete example
Customer segmentation: plot customers by spend and frequency, run k-means with k=3, and you get "budget", "regular", and "VIP" groups — each centroid is the profile of a segment. Same idea powers image color quantization and document clustering.
Key equations
objective: minimize WCSS = Σ_clusters Σ_points ‖x − μ_cluster‖²assign: cluster(x) = argminₖ ‖x − μₖ‖²update: μₖ = mean of all points assigned to kalternate assign/update (Lloyd’s algorithm) — WCSS never increasesconverges to a LOCAL optimum (depends on initialization)
Step by step
- Initialize k centroids (here, random points).
- Assign step: color each point by its nearest centroid.
- Update step: move each centroid to the mean of its colored points.
- WCSS (the loss) drops each iteration — watch the curve.
- Press play; centroids stop moving at convergence. Reset re-initializes.
Interview questions & answers
How do you choose k?
No labels, so use heuristics: the elbow method (plot WCSS vs k, pick the bend), silhouette score, or gap statistic. Domain knowledge often dominates.
Why can k-means give different results across runs?
It converges to a LOCAL optimum that depends on the random initial centroids. k-means++ initialization spreads seeds out to reduce bad local minima; running multiple restarts and keeping the lowest WCSS is common.
When does k-means fail?
It assumes roughly spherical, equally-sized clusters (Euclidean). It struggles with elongated, nested, or varying-density shapes — try GMMs, DBSCAN, or spectral clustering.
Is k-means guaranteed to converge?
Yes, to a local optimum: each assign/update step never increases WCSS and there are finitely many assignments, so it terminates — but not necessarily the global optimum.
Common pitfalls
- Not scaling/normalizing features — a large-range feature dominates the distance.
- Assuming the result is THE clustering; it’s one local optimum of many.
- Using k-means on non-spherical or very different-density clusters.
Where it shows up
- Customer/user segmentation
- Image color quantization & compression
- Document / embedding clustering, vector-index coarse quantizers (IVF)
More AI / ML interview concepts
- Neural Networks & Backpropagation
- Gradient Descent & Optimizers
- Activation Functions
- Self-Attention
- Multi-Head Attention
- Softmax, Temperature & Sampling
- Tokenization (Byte-Pair Encoding)
- Positional Encoding
- KV Cache
- Rotary Position Embedding (RoPE)
- The Transformer Block
- Normalization (LayerNorm / RMSNorm)
- Multi-Query & Grouped-Query Attention
- Flash Attention
- Decoding: Beam Search & Speculative Decoding
- Embeddings & Cosine Similarity
- RAG (Retrieval-Augmented Generation) Pipeline
- Vector Search (HNSW)
- Chunking & Reranking
- ReAct Agent Loop
- Tool / Function Calling
- Multi-Agent Orchestration
- Planning & Task Decomposition
- Agent Memory
- Model Context Protocol (MCP)
- Quantization
- LoRA / PEFT Fine-Tuning
- Mixture of Experts (MoE)
- RLHF / DPO Alignment
- Evals & LLM-as-Judge
- Prompt Injection & Guardrails
- Knowledge Distillation
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…