🧠 Mixture of Experts (MoE) — AI / ML Interview Guide

Production & Training · interactive visualization + interview prep

Open the interactive Mixture of Experts (MoE) visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.

What it is

Instead of every token passing through one giant feed-forward network, a Mixture of Experts has many smaller expert networks and a ROUTER that sends each token to only the top-k experts (e.g. 2 of 8). Most experts stay idle per token, so you get a huge total parameter count but only pay compute for the few that fire.

Mental model

A hospital triage desk. Instead of every patient seeing every doctor, a router sends each one to just the 2 specialists they actually need. The hospital has huge total expertise (many doctors = many parameters), but each patient consumes only a sliver of it (k active experts). That is the whole trick: capacity scales with the number of doctors, but cost scales only with how many each patient sees.

Theory

A Mixture of Experts replaces a single large feed-forward network with N smaller expert networks plus a ROUTER. For each token the router picks only the top-k experts (e.g. 2 of 8) to run; the rest stay idle. This is CONDITIONAL COMPUTATION — the network that runs depends on the input.

The payoff is decoupling capacity from compute. TOTAL parameters = all N experts (this is the model's knowledge/capacity, and its memory footprint). ACTIVE parameters per token ≈ the k experts that actually fire (this is the compute / FLOPs). So MoE gives large-model quality at small-model inference cost — Mixtral 8×7B has ~47B total but only ~13B active per token.

The router is a small learned gate: g(x) = softmax(W_r·x) produces a weight per expert; you select the top-k and combine their outputs weighted by the gate values, output = Σ over top-k of gᵢ·Expertᵢ(x). Routing is discrete (a selection), which is part of what makes MoE training trickier than a dense model.

The central training challenge is LOAD BALANCING. Left alone, the router tends to collapse onto a few favorite experts, leaving the rest undertrained and wasting capacity. An auxiliary load-balancing loss pushes the router toward even expert usage so all experts learn and contribute.

The downsides are real: you must STORE all N experts even though only k run, so memory is high; routing adds complexity; and serving needs expert parallelism (experts spread across devices) plus the balancing machinery. MoE wins when you want more knowledge without more per-token FLOPs — not as a free upgrade.

Concrete example

Mixtral 8×7B has 8 experts but activates 2 per token — ~47B total parameters yet only ~13B active per token. You get big-model quality at small-model inference cost.

Key equations

router g(x) = softmax(Wᵣ · x) → a weight per expert
select top-k experts by gate weight (sparse: k ≪ N)
output = Σ_{i∈topk} gᵢ · Expertᵢ(x)
total params = N experts; ACTIVE params per token ≈ k experts
capacity ↑ without compute ↑ (conditional computation)

Step by step

A token arrives at the MoE layer.
The router scores all N experts (a gating distribution).
Only the top-k experts are selected (the rest stay dark).
Their outputs are combined, weighted by the gates.
Each token uses a different few experts — sparse, conditional compute.

Interview questions & answers

Why use MoE instead of a bigger dense model?

Conditional computation: MoE adds parameters (capacity/knowledge) without adding much per-token compute, since only k of N experts run. You get large-model quality at a fraction of the FLOPs.

What is the load-balancing problem?

The router can collapse to always picking the same few experts, leaving others untrained. An auxiliary load-balancing loss encourages even expert usage so capacity isn’t wasted.

What are MoE’s downsides?

High memory (all experts must be stored/loaded even if rarely used), routing complexity, and harder training/serving (expert parallelism, balancing) than a dense model.

Dense vs active vs total parameters?

Total = all experts’ params (memory). Active = the k experts that actually run per token (compute). MoE’s pitch is large total, small active.

Common pitfalls

Ignoring load balancing → a few experts dominate, the rest are dead weight.
Forgetting memory cost — you store ALL experts even though only k fire.
Assuming more experts always helps — routing + balancing get harder.

Where it shows up

Mixtral, GPT-4-class MoE, Switch Transformer, GLaM, DeepSeek-MoE
Scaling parameters without scaling inference FLOPs

More AI / ML interview concepts

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…