🧠 Activation Functions — AI / ML Interview Guide

Neural Foundations · interactive visualization + interview prep

Open the interactive Activation Functions visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.

What it is

Activation functions add NON-LINEARITY between layers. Without them, stacking linear layers just collapses into one linear layer — the network could only draw straight boundaries. Non-linear activations let a network bend space and learn complex functions.

Mental model

Picture each neuron's signal passing through a shaped valve. The valve's BEND is the only source of a network's expressive power — straighten every valve (make them linear) and the whole deep net collapses to a single line. And the STEEPNESS of the valve at the current signal is what gradients ride back through during backprop: where the valve is flat (saturated), almost no gradient passes and that neuron stops learning. So you want a valve that bends (to learn) but doesn't go flat (to keep learning).

Theory

An activation is a scalar non-linear function f applied element-wise after each layer's affine map. Its job is to break linearity: without it, a stack of layers W_L(…W₁x) is algebraically just one matrix, so depth buys nothing. The non-linearity is what lets composed layers represent curved, hierarchical functions.

During backprop, the gradient flowing into a layer is multiplied by f′(z), the activation's DERIVATIVE at that neuron's pre-activation. This makes the shape of f′ as important as f. If f saturates (flattens) for large |z|, then f′→0 and the gradient is throttled — chain that across many layers and you get the vanishing-gradient problem.

Sigmoid and tanh saturate at both ends (f′≤0.25 for sigmoid), so deep stacks of them train glacially. ReLU, max(0,z), fixes this for positive inputs — its derivative is exactly 1 there, so gradients flow undiminished, and it is cheap and induces sparsity. Its weakness is "dying ReLU": a neuron pushed permanently negative outputs 0 with 0 gradient and never recovers.

The fixes trade simplicity for robustness. Leaky/Parametric ReLU give negatives a small slope so they stay alive. GELU (and SiLU/Swish) are smooth, near-ReLU curves that are differentiable everywhere and empirically optimize better — GELU is the default in transformers (GPT, BERT).

Output activations are a separate choice driven by the task, not the hidden-layer reasoning: sigmoid for a single (0,1) probability, softmax for a categorical distribution, and linear (no activation) for unbounded regression targets.

Concrete example

ReLU (max(0, x)) is the workhorse: cheap and trains well, but “dead” neurons stuck at 0 for negative inputs led to Leaky ReLU and GELU. Transformers (GPT/BERT) use GELU — a smooth ReLU — because the smoothness helps optimization.

Key equations

ReLU(x) = max(0, x) — sparse, fast, can “die” for x < 0
Leaky ReLU(x) = max(αx, x) — small slope keeps negatives alive
Sigmoid(x) = 1/(1+e⁻ˣ) ∈ (0,1) — saturates → vanishing gradients
Tanh(x) ∈ (−1,1) — zero-centered but still saturates
GELU(x) ≈ x·Φ(x) — smooth, default in transformers

Step by step

Each curve maps a pre-activation z to an output a = f(z).
The dashed curve is the DERIVATIVE — what backprop multiplies by.
Where the derivative is ~0 (saturation), gradients vanish and learning stalls.
ReLU-family derivatives stay 1 for x>0 → gradients flow; that’s why they train well.

Interview questions & answers

Why do we need a non-linear activation at all?

Composing linear layers is still linear — the whole network reduces to one matrix. Non-linearity is what lets a deep net approximate complex, non-linear functions.

What is the vanishing-gradient problem and which activations cause it?

Sigmoid/tanh saturate for large |x|, so their derivative → 0; multiplied across many layers, gradients shrink to nothing and deep layers stop learning. ReLU-family avoids this for positive inputs (derivative 1).

Why ReLU over sigmoid in hidden layers?

Cheap to compute, non-saturating for x>0 (gradients flow), and induces sparsity. Sigmoid is mostly reserved for (0,1) outputs like binary probabilities.

What is the “dying ReLU” problem and a fix?

A neuron stuck outputting 0 for all inputs (always negative pre-activation) gets zero gradient and never recovers. Leaky/Parametric ReLU or GELU give a non-zero negative slope to fix it.

Common pitfalls

Sigmoid/tanh in deep hidden layers → vanishing gradients.
Plain ReLU with a high learning rate → many dead neurons.
Forgetting the output activation must match the task (softmax/sigmoid/linear).

Where it shows up

Every neural net’s hidden layers
GELU in transformers; ReLU in CNNs/MLPs; sigmoid/softmax at outputs

More AI / ML interview concepts

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…