🧠 Agent Memory — AI / ML Interview Guide

Agentic Systems · interactive visualization + interview prep

Open the interactive Agent Memory visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.

What it is

An agent’s working memory is its context window — but that has a fixed size. As a conversation grows, old turns must leave the window. Short-term memory keeps the recent turns; when it overflows, older turns are SUMMARIZED (or dropped); and important facts are written to LONG-TERM memory (a vector store) so they can be retrieved later even after they’ve left the window.

Mental model

The context window is your DESK — only so much fits. Recent turns are the papers spread on the desk (short-term memory); when it overflows you compress old notes into a summary and FILE durable facts into a cabinet (long-term vector store), pulling them back out by relevance when needed. "Agent memory" is really desk management plus a filing system — not a bigger brain.

Theory

An agent's working memory is its context window, which has a FIXED token budget. As a conversation grows, older turns must leave the window. So "memory" in agents is the engineering around this constraint: deciding what to keep verbatim, what to compress, what to persist elsewhere, and what to pull back in.

Short-term memory is the recent turns held in the window. When it overflows, the oldest turns are SUMMARIZED (compaction) into a compact summary that preserves key information while freeing budget — or simply evicted. This trades detail for room.

Long-term memory persists salient facts OUTSIDE the window, typically by embedding them into a vector store. On each turn the agent retrieves the most relevant stored facts by semantic similarity and injects them back into context — so a fact mentioned 200 turns ago can resurface even though that turn scrolled out long ago (this reuses the Embeddings + RAG machinery).

The prompt assembled each turn is therefore a composition: system instructions + a running summary of old turns + retrieved long-term facts + the recent verbatim turns. Managing that budget is the heart of agent memory.

Why not just use a bigger context window? It delays the problem but does not solve it: cost and latency grow with context, models get "lost in the middle" of very long inputs, and history is unbounded. The failure modes are also instructive — no long-term store means forgetting anything that scrolls out; over-aggressive summarization loses needed detail; and retrieving irrelevant memories distracts or contradicts the model.

Concrete example

You tell the bot "my dog is named Rex" early on. Many turns later the window has scrolled past that turn — but because "dog = Rex" was saved to long-term memory, asking "what’s my dog’s name?" still retrieves Rex. Without long-term memory, the agent would have forgotten.

Key equations

context window: a fixed token budget for recent turns
overflow → summarize oldest turns into a compact summary (or evict them)
long-term memory: embed & store facts in a vector DB
on each turn: retrieve relevant long-term facts → add to context
context = system + summary + retrieved facts + recent turns

Step by step

New turns enter the context window (short-term memory).
When the window is full, the oldest turns are summarized and evicted.
Salient facts are written to long-term memory as you go.
A later question retrieves the relevant fact from long-term memory…
…even though that turn already scrolled out of the window.

Interview questions & answers

Why isn’t a bigger context window enough?

It delays but doesn’t solve the problem: cost/latency grow with context, models get “lost in the middle” of very long contexts, and history is unbounded. You still need summarization + retrieval to scale and stay relevant.

Short-term vs long-term memory?

Short-term = the recent turns held in the context window (volatile, bounded). Long-term = facts persisted outside the window (e.g., a vector store) and retrieved on demand.

How do you decide what to remember long-term?

Heuristics or an LLM extract durable facts (names, preferences, decisions) and skip chit-chat; store with embeddings so they’re retrievable by semantic similarity.

What is summarization (a.k.a. compaction) doing?

Compressing many old turns into a short summary that preserves key information while freeing context budget — trading detail for room.

Common pitfalls

No long-term store → the agent forgets anything that scrolls out of the window.
Summarizing too aggressively → losing details you later need.
Retrieving irrelevant memories → distracting/contradicting the model.

Where it shows up

Conversational assistants with persistent memory
LangChain/LlamaIndex memory + vector stores
Context compaction in long agent runs

More AI / ML interview concepts

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…