Interactive tutorial on how attention mechanisms work in transformers
Transformers revolutionized AI by introducing the attention mechanism. Unlike earlier models that process words sequentially, transformers can look at all words simultaneously and decide which ones are most relevant for understanding each word.
Understand how transformers use Query, Key, and Value matrices to compute attention scores, allowing the model to focus on relevant parts of the input when processing each word.
Let's work with this sentence that has clear relationships:
Think about it: To understand "cat" fully, the model needs to know it's a "black cat" (not just any cat) that "ate fish" (not sleeping or playing). This is why attention matters - words get meaning from their context!
Process words sequentially: "The" → "black" → "cat" → "ate" → "fish"
Slow, struggles with long sequences
Process all words at once using self-attention
Fast, handles long sequences well
Instead of processing words in order, transformers ask: "For each word, which other words in the sentence should I pay attention to?"
Attention Matrix (Bidirectional)
Each row shows how that word attends to all other words. Darker = stronger attention.
✓ Bidirectional Attention: Each word sees the entire sentence. Perfect for understanding context when you have the full input (like BERT, for classification, understanding).
Attention Matrix (Causal/Masked)
Each row shows how that word attends ONLY to itself and previous words. Future positions are masked (⨯).
⚡ Causal/Masked Attention: Each word can only see itself and previous words. Essential for text generation where you predict one word at a time (like GPT, Claude).
This is what powers ChatGPT, Claude, Gemini and most other modern AI language models. The attention mechanism lets them understand complex relationships between words, no matter how far apart they are in a sentence!
Same vector regardless of context.
Can't ask questions or provide answers.
Dynamic representations for each role.
Can query, be queried, and provide values.
Each word's embedding is transformed by three learned matrices to create these specialized representations:
WQ matrix
"What am I looking for?"
WK matrix
"What do I offer?"
WV matrix
"What information do I contain?"
Embeddings are learned representations where each dimension captures some learned concept. The numbers are relative strengths - higher means stronger association. In real models, there are 768-12,288 dimensions!
↓ We'll use the cat embedding [0.2, 0.8, 0.1] in the transformation example below
Let's see how the word "cat" gets transformed (simplified 3D example):
WQ creates a "search pattern" by recombining embedding dimensions.
row 1 Row 1 of WQ [1.0, 0.2, 0.5] creates the first Query dimension:
Metaphor: WQ turns "cat" into a magnet asking "I'm looking for animal-related, noun-like words"
WK creates an "identity tag" by recombining embedding dimensions.
row 1 Row 1 of WK [0.8, 1.1, 0.2] creates the first Key dimension:
Metaphor: WK turns "cat" into a magnet offering "I provide noun-related, animal information"
WV creates the "content package" that will be passed forward.
row 1 Row 1 of WV [0.6, 0.9, 1.1] creates the first Value dimension:
Metaphor: WV packages "cat's" actual semantic content to be delivered based on attention weights
Now that we have Query, Key, and Value vectors for each word, we use them to calculate attention scores through dot products.
where dk is the dimension of the key vectors (scaling factor)
Compare "cat's" Query with all Keys via dot product:
Higher dot product = better match between what "cat" is looking for and what that word offers
Convert scores to probabilities (sum to 1):
These are the attention weights!
Multiply each word's Value by its attention weight and sum:
The output for "cat" is now enriched with information from "black" (its most attended word) and other context!
Why This Works: The Query-Key dot product is high when both vectors point in similar directions - meaning "cat's" query matches well with "black's" key. This automatically determines which words provide the most relevant context!
What's happening: When you adjust Q_cat values, you're changing what "cat" is "looking for". Higher values in dimensions where Key vectors also have high values = stronger attention!
Example: If you increase Q_cat[1] to match K_black[1] (both = 1.0), "cat" will pay more attention to "black" because they align in that dimension. This is how transformers learn semantic relationships!