🧠 Gradient Descent & Optimizers — AI / ML Interview Guide
Neural Foundations · interactive visualization + interview prep
Open the interactive Gradient Descent & Optimizers visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.
What it is
Training = minimizing a loss. Gradient descent does it by repeatedly stepping downhill: compute the gradient (the direction of steepest increase) and move the opposite way. Optimizers like Momentum and Adam change HOW you step to converge faster and avoid getting stuck or zig-zagging.
Mental model
You are blindfolded on a hilly landscape (the loss surface) and can only feel the slope under your feet (the gradient). To get to the bottom you repeatedly step in the steepest-downhill direction. The optimizer is your stepping STRATEGY: SGD takes fixed steps and zig-zags in ravines; Momentum keeps the velocity it built up so it rolls through them; Adam remembers how steep each direction has been and rescales its step per-direction. Same goal, smarter feet.
Theory
Training is an optimization problem: find parameters θ minimizing a loss L(θ). For anything but trivial models there is no closed-form solution, so we iterate. The gradient ∇L(θ) points in the direction of STEEPEST INCREASE of the loss, so stepping the opposite way, θ ← θ − η∇L(θ), locally decreases it. η (the learning rate) sets how far we step.
We use first-order (gradient-only) methods because they scale. Second-order methods (Newton's) use the Hessian for a better step direction but cost O(n²) memory and O(n³) to invert — impossible for models with millions/billions of parameters. First-order steps are cheap (one backprop) and, repeated enough times, get there.
Plain gradient descent struggles on ill-conditioned surfaces — long narrow valleys where one direction is far steeper than another. It oscillates across the steep walls while crawling along the flat floor. Momentum fixes this by accumulating an exponentially-weighted velocity of past gradients: oscillations cancel out, consistent directions accelerate.
Adam (Adaptive Moment Estimation) keeps two running averages per parameter: the 1st moment (mean of gradients, like momentum) and the 2nd moment (mean of squared gradients, a per-direction scale). It divides the step by √(2nd moment), so directions with large/noisy gradients take small steps and flat directions take large ones — plus a bias correction for the cold-start of the averages. This robustness is why Adam/AdamW is the default for transformers.
Three practical truths: (1) batch size trades gradient noise for hardware efficiency — mini-batches are the standard. (2) On non-convex losses you converge to a LOCAL minimum (or saddle/plateau), not necessarily the global one — but in high dimensions good local minima are plentiful. (3) The learning rate usually is not constant: warmup then decay (cosine, step) is standard for stable, fast training.
Concrete example
On an elongated, valley-shaped loss surface, plain SGD bounces across the valley walls and crawls along the floor. Momentum builds speed along the floor; Adam adapts the step per-dimension — both reach the minimum in far fewer steps. That’s why modern training uses Adam, not vanilla SGD.
Key equations
gradient descent: θ ← θ − η·∇L(θ) (η = learning rate)Momentum: v ← β·v − η·∇L; θ ← θ + v (accumulates velocity)Adam: per-parameter adaptive steps from 1st & 2nd moment estimates of the gradienttoo-large η → diverge/oscillate; too-small → crawl
Step by step
- Start at some point on the loss surface (the contour map).
- Compute the gradient there; step downhill (opposite the gradient).
- The loss curve drops with each step.
- Watch SGD zig-zag on the elongated valley vs Momentum/Adam cutting straight to the min.
Interview questions & answers
What does the learning rate control, and what if it’s wrong?
Step size. Too large → overshoot, oscillate, or diverge; too small → very slow convergence and getting stuck in flat regions. It’s the most important hyperparameter.
Why does Momentum help?
It accumulates a velocity from past gradients, damping oscillations across a ravine and accelerating along consistent directions — so it escapes plateaus and converges faster.
What does Adam do differently?
It keeps running estimates of the gradient’s mean and variance and adapts the step size per parameter — large steps where gradients are small/consistent, small where they’re noisy/large. Robust default for deep nets.
Batch vs stochastic vs mini-batch GD?
Batch uses the full dataset per step (stable, slow). SGD uses one example (noisy, fast, helps escape minima). Mini-batch (the standard) balances both and uses hardware well.
Common pitfalls
- Learning rate too high → divergence; too low → stalls. Tune it first.
- Forgetting a learning-rate schedule/warmup for large models.
- Assuming convergence to a global min — it’s a local min on non-convex losses.
Where it shows up
- Training every neural network (SGD, Momentum, Adam/AdamW)
- Optimizers in PyTorch / TensorFlow / JAX
More AI / ML interview concepts
- Neural Networks & Backpropagation
- Activation Functions
- K-Means Clustering
- Self-Attention
- Multi-Head Attention
- Softmax, Temperature & Sampling
- Tokenization (Byte-Pair Encoding)
- Positional Encoding
- KV Cache
- Rotary Position Embedding (RoPE)
- The Transformer Block
- Normalization (LayerNorm / RMSNorm)
- Multi-Query & Grouped-Query Attention
- Flash Attention
- Decoding: Beam Search & Speculative Decoding
- Embeddings & Cosine Similarity
- RAG (Retrieval-Augmented Generation) Pipeline
- Vector Search (HNSW)
- Chunking & Reranking
- ReAct Agent Loop
- Tool / Function Calling
- Multi-Agent Orchestration
- Planning & Task Decomposition
- Agent Memory
- Model Context Protocol (MCP)
- Quantization
- LoRA / PEFT Fine-Tuning
- Mixture of Experts (MoE)
- RLHF / DPO Alignment
- Evals & LLM-as-Judge
- Prompt Injection & Guardrails
- Knowledge Distillation
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…