🧠 Mixture of Experts (MoE) — AI / ML Interview Guide
Production & Training · interactive visualization + interview prep
Open the interactive Mixture of Experts (MoE) visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.
What it is
Instead of every token passing through one giant feed-forward network, a Mixture of Experts has many smaller expert networks and a ROUTER that sends each token to only the top-k experts (e.g. 2 of 8). Most experts stay idle per token, so you get a huge total parameter count but only pay compute for the few that fire.
Mental model
A hospital triage desk. Instead of every patient seeing every doctor, a router sends each one to just the 2 specialists they actually need. The hospital has huge total expertise (many doctors = many parameters), but each patient consumes only a sliver of it (k active experts). That is the whole trick: capacity scales with the number of doctors, but cost scales only with how many each patient sees.
Theory
A Mixture of Experts replaces a single large feed-forward network with N smaller expert networks plus a ROUTER. For each token the router picks only the top-k experts (e.g. 2 of 8) to run; the rest stay idle. This is CONDITIONAL COMPUTATION — the network that runs depends on the input.
The payoff is decoupling capacity from compute. TOTAL parameters = all N experts (this is the model's knowledge/capacity, and its memory footprint). ACTIVE parameters per token ≈ the k experts that actually fire (this is the compute / FLOPs). So MoE gives large-model quality at small-model inference cost — Mixtral 8×7B has ~47B total but only ~13B active per token.
The router is a small learned gate: g(x) = softmax(W_r·x) produces a weight per expert; you select the top-k and combine their outputs weighted by the gate values, output = Σ over top-k of gᵢ·Expertᵢ(x). Routing is discrete (a selection), which is part of what makes MoE training trickier than a dense model.
The central training challenge is LOAD BALANCING. Left alone, the router tends to collapse onto a few favorite experts, leaving the rest undertrained and wasting capacity. An auxiliary load-balancing loss pushes the router toward even expert usage so all experts learn and contribute.
The downsides are real: you must STORE all N experts even though only k run, so memory is high; routing adds complexity; and serving needs expert parallelism (experts spread across devices) plus the balancing machinery. MoE wins when you want more knowledge without more per-token FLOPs — not as a free upgrade.
Concrete example
Mixtral 8×7B has 8 experts but activates 2 per token — ~47B total parameters yet only ~13B active per token. You get big-model quality at small-model inference cost.
Key equations
router g(x) = softmax(Wᵣ · x) → a weight per expertselect top-k experts by gate weight (sparse: k ≪ N)output = Σ_{i∈topk} gᵢ · Expertᵢ(x)total params = N experts; ACTIVE params per token ≈ k expertscapacity ↑ without compute ↑ (conditional computation)
Step by step
- A token arrives at the MoE layer.
- The router scores all N experts (a gating distribution).
- Only the top-k experts are selected (the rest stay dark).
- Their outputs are combined, weighted by the gates.
- Each token uses a different few experts — sparse, conditional compute.
Interview questions & answers
Why use MoE instead of a bigger dense model?
Conditional computation: MoE adds parameters (capacity/knowledge) without adding much per-token compute, since only k of N experts run. You get large-model quality at a fraction of the FLOPs.
What is the load-balancing problem?
The router can collapse to always picking the same few experts, leaving others untrained. An auxiliary load-balancing loss encourages even expert usage so capacity isn’t wasted.
What are MoE’s downsides?
High memory (all experts must be stored/loaded even if rarely used), routing complexity, and harder training/serving (expert parallelism, balancing) than a dense model.
Dense vs active vs total parameters?
Total = all experts’ params (memory). Active = the k experts that actually run per token (compute). MoE’s pitch is large total, small active.
Common pitfalls
- Ignoring load balancing → a few experts dominate, the rest are dead weight.
- Forgetting memory cost — you store ALL experts even though only k fire.
- Assuming more experts always helps — routing + balancing get harder.
Where it shows up
- Mixtral, GPT-4-class MoE, Switch Transformer, GLaM, DeepSeek-MoE
- Scaling parameters without scaling inference FLOPs
More AI / ML interview concepts
- Neural Networks & Backpropagation
- Gradient Descent & Optimizers
- Activation Functions
- K-Means Clustering
- Self-Attention
- Multi-Head Attention
- Softmax, Temperature & Sampling
- Tokenization (Byte-Pair Encoding)
- Positional Encoding
- KV Cache
- Rotary Position Embedding (RoPE)
- The Transformer Block
- Normalization (LayerNorm / RMSNorm)
- Multi-Query & Grouped-Query Attention
- Flash Attention
- Decoding: Beam Search & Speculative Decoding
- Embeddings & Cosine Similarity
- RAG (Retrieval-Augmented Generation) Pipeline
- Vector Search (HNSW)
- Chunking & Reranking
- ReAct Agent Loop
- Tool / Function Calling
- Multi-Agent Orchestration
- Planning & Task Decomposition
- Agent Memory
- Model Context Protocol (MCP)
- Quantization
- LoRA / PEFT Fine-Tuning
- RLHF / DPO Alignment
- Evals & LLM-as-Judge
- Prompt Injection & Guardrails
- Knowledge Distillation
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…