🧠 Embeddings & Cosine Similarity — AI / ML Interview Guide

LLM Applications · interactive visualization + interview prep

Open the interactive Embeddings & Cosine Similarity visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.

What it is

An embedding turns a word (or sentence, image, row…) into a vector — a point in space — so that things with similar meaning land in similar directions. "Closeness" is measured by the ANGLE between vectors (cosine similarity), not their raw distance, so length/frequency doesn’t distort meaning.

Mental model

Meaning becomes geometry. Every piece of text is placed as an arrow in a high-dimensional space, and "related" means "pointing the same way". To search, you embed the query as one more arrow and ask "whose arrow points most like mine?" — DIRECTION (cosine), not length, is the answer. The whole of semantic search, RAG, recommendation, and clustering is variations on placing things in this space and measuring angles.

Theory

An embedding is a learned map from a discrete object (word, sentence, image, user) to a dense vector in ℝⁿ, trained so that semantically similar inputs land nearby. The principle behind text embeddings is the distributional hypothesis: words that appear in similar contexts have similar meanings, so a model trained to predict context (word2vec) or to encode sentences (a transformer) ends up arranging meaning geometrically. Modern embeddings are CONTEXTUAL — the same word gets different vectors in different sentences — unlike static word2vec.

Similarity is measured by cosine: the cosine of the angle between two vectors, (a·b)/(‖a‖‖b‖), ranging from +1 (same direction) through 0 (orthogonal, unrelated) to −1 (opposite). Cosine is magnitude-INVARIANT, which is why it beats Euclidean distance for text: a long, common-word vector and a short, rare-word vector with the same meaning still score high, because only their direction matters.

In practice embeddings are L2-normalized to unit length. After normalization cosine similarity equals a plain dot product — faster to compute at scale — and magnitude can no longer skew the ranking. This is why vector databases store normalized vectors and rank by dot product.

High dimensionality brings the "curse of dimensionality": as n grows, raw distances concentrate (everything looks roughly equidistant), making naive distance noisy and brute-force search expensive. This motivates cosine-based ranking plus approximate-nearest-neighbor indexes like HNSW to search millions of vectors in milliseconds (see the Vector Search concept).

A critical constraint: vectors only live in the same space if they came from the SAME model. You can never compare embeddings across different models, and a generic embedding model may collapse domain jargon together — which is why specialized or fine-tuned embedding models matter for niche corpora.

Concrete example

In a real embedding space, "king" and "queen" point in nearly the same direction (cosine ≈ 0.8), while "king" and "dog" point almost opposite (cosine ≈ 0). This is exactly how semantic search / RAG finds relevant documents: embed the query, then return the chunks whose vectors have the highest cosine similarity to it.

Key equations

Step by step

  1. Each word is embedded as a vector (shown as a point / arrow from the origin).
  2. Pick a query word — click any point.
  3. Cosine similarity = the cosine of the angle between the query vector and each other vector.
  4. Words in the same direction (small angle) score near 1; opposite directions near −1.
  5. Sort by similarity → that ranking is what a vector search / RAG retriever returns.

Interview questions & answers

Why cosine similarity instead of Euclidean distance?

Cosine measures direction (angle), ignoring magnitude — so a long, common-word vector and a short, rare-word vector with the same meaning still score high. Euclidean distance is dominated by magnitude and dimensionality. For normalized embeddings the two are monotonically related, but cosine is the standard for text.

What does an embedding actually encode?

Learned features such that semantically/contextually similar inputs are nearby in vector space. They come from a trained model (word2vec, or a transformer’s hidden states / a dedicated embedding model), not hand-assigned.

How are embeddings used in RAG / semantic search?

Documents are chunked and embedded into a vector DB offline. At query time you embed the query and retrieve the top-k chunks by cosine similarity, then feed them to the LLM as context. Quality hinges on the embedding model + chunking.

Why normalize embeddings to unit length?

After L2-normalization, cosine similarity equals a plain dot product (faster), and magnitude no longer skews ranking — every vector compared purely by direction.

What’s the "curse of dimensionality" for embeddings?

In high dimensions distances concentrate (everything looks roughly equidistant), so raw distance gets noisy; cosine + approximate-nearest-neighbor indexes (HNSW) are used to search efficiently and meaningfully.

Common pitfalls

Where it shows up

More AI / ML interview concepts

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…