🧠 Agent Memory — AI / ML Interview Guide
Agentic Systems · interactive visualization + interview prep
Open the interactive Agent Memory visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.
What it is
An agent’s working memory is its context window — but that has a fixed size. As a conversation grows, old turns must leave the window. Short-term memory keeps the recent turns; when it overflows, older turns are SUMMARIZED (or dropped); and important facts are written to LONG-TERM memory (a vector store) so they can be retrieved later even after they’ve left the window.
Mental model
The context window is your DESK — only so much fits. Recent turns are the papers spread on the desk (short-term memory); when it overflows you compress old notes into a summary and FILE durable facts into a cabinet (long-term vector store), pulling them back out by relevance when needed. "Agent memory" is really desk management plus a filing system — not a bigger brain.
Theory
An agent's working memory is its context window, which has a FIXED token budget. As a conversation grows, older turns must leave the window. So "memory" in agents is the engineering around this constraint: deciding what to keep verbatim, what to compress, what to persist elsewhere, and what to pull back in.
Short-term memory is the recent turns held in the window. When it overflows, the oldest turns are SUMMARIZED (compaction) into a compact summary that preserves key information while freeing budget — or simply evicted. This trades detail for room.
Long-term memory persists salient facts OUTSIDE the window, typically by embedding them into a vector store. On each turn the agent retrieves the most relevant stored facts by semantic similarity and injects them back into context — so a fact mentioned 200 turns ago can resurface even though that turn scrolled out long ago (this reuses the Embeddings + RAG machinery).
The prompt assembled each turn is therefore a composition: system instructions + a running summary of old turns + retrieved long-term facts + the recent verbatim turns. Managing that budget is the heart of agent memory.
Why not just use a bigger context window? It delays the problem but does not solve it: cost and latency grow with context, models get "lost in the middle" of very long inputs, and history is unbounded. The failure modes are also instructive — no long-term store means forgetting anything that scrolls out; over-aggressive summarization loses needed detail; and retrieving irrelevant memories distracts or contradicts the model.
Concrete example
You tell the bot "my dog is named Rex" early on. Many turns later the window has scrolled past that turn — but because "dog = Rex" was saved to long-term memory, asking "what’s my dog’s name?" still retrieves Rex. Without long-term memory, the agent would have forgotten.
Key equations
context window: a fixed token budget for recent turnsoverflow → summarize oldest turns into a compact summary (or evict them)long-term memory: embed & store facts in a vector DBon each turn: retrieve relevant long-term facts → add to contextcontext = system + summary + retrieved facts + recent turns
Step by step
- New turns enter the context window (short-term memory).
- When the window is full, the oldest turns are summarized and evicted.
- Salient facts are written to long-term memory as you go.
- A later question retrieves the relevant fact from long-term memory…
- …even though that turn already scrolled out of the window.
Interview questions & answers
Why isn’t a bigger context window enough?
It delays but doesn’t solve the problem: cost/latency grow with context, models get “lost in the middle” of very long contexts, and history is unbounded. You still need summarization + retrieval to scale and stay relevant.
Short-term vs long-term memory?
Short-term = the recent turns held in the context window (volatile, bounded). Long-term = facts persisted outside the window (e.g., a vector store) and retrieved on demand.
How do you decide what to remember long-term?
Heuristics or an LLM extract durable facts (names, preferences, decisions) and skip chit-chat; store with embeddings so they’re retrievable by semantic similarity.
What is summarization (a.k.a. compaction) doing?
Compressing many old turns into a short summary that preserves key information while freeing context budget — trading detail for room.
Common pitfalls
- No long-term store → the agent forgets anything that scrolls out of the window.
- Summarizing too aggressively → losing details you later need.
- Retrieving irrelevant memories → distracting/contradicting the model.
Where it shows up
- Conversational assistants with persistent memory
- LangChain/LlamaIndex memory + vector stores
- Context compaction in long agent runs
More AI / ML interview concepts
- Neural Networks & Backpropagation
- Gradient Descent & Optimizers
- Activation Functions
- K-Means Clustering
- Self-Attention
- Multi-Head Attention
- Softmax, Temperature & Sampling
- Tokenization (Byte-Pair Encoding)
- Positional Encoding
- KV Cache
- Rotary Position Embedding (RoPE)
- The Transformer Block
- Normalization (LayerNorm / RMSNorm)
- Multi-Query & Grouped-Query Attention
- Flash Attention
- Decoding: Beam Search & Speculative Decoding
- Embeddings & Cosine Similarity
- RAG (Retrieval-Augmented Generation) Pipeline
- Vector Search (HNSW)
- Chunking & Reranking
- ReAct Agent Loop
- Tool / Function Calling
- Multi-Agent Orchestration
- Planning & Task Decomposition
- Model Context Protocol (MCP)
- Quantization
- LoRA / PEFT Fine-Tuning
- Mixture of Experts (MoE)
- RLHF / DPO Alignment
- Evals & LLM-as-Judge
- Prompt Injection & Guardrails
- Knowledge Distillation
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…