D-Lab - Understanding Transformer Attention

1What is Attention?

Simple Definition: Attention is a mechanism that helps the model decide which words to "pay attention to" when understanding each word in a sentence.

Let's work with this sentence that has clear relationships:

The black cat ate fish

Think about it: To understand "cat" fully, the model needs to know it's a "black cat" (not just any cat) that "ate fish" (not sleeping or playing). This is why attention matters - words get meaning from their context!

2What are Transformers?

The Big Picture: Transformers are a type of neural network architecture that revolutionized AI by processing all words in a sentence simultaneously, rather than one by one like older models. Previous models like RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory networks) had to read text sequentially, word by word, which made them slower and less effective at understanding long-range relationships.

How Transformers Differ from Earlier Models

Old Way (RNNs/LSTMs)

Process words sequentially: "The" → "black" → "cat" → "ate" → "fish"

Slow, struggles with long sequences

Transformer Way

Process all words at once using self-attention

Fast, handles long sequences well

The Key Innovation: Self-Attention

Instead of processing words in order, transformers ask: "For each word, which other words in the sentence should I pay attention to?"

Attention Matrix (Bidirectional)

Each row shows how that word attends to all other words. Darker = stronger attention.

the

black

cat

ate

fish

the

0.30

0.10

0.15

0.35

0.10

black

0.10

0.30

0.50

0.05

cat

0.10

0.45

0.00

0.30

0.15

ate

0.10

0.15

0.40

0.20

0.15

fish

0.05

0.10

0.20

0.50

0.15

Focus on "cat" row: Can attend to ALL words (including future words "ate" and "fish")

✓ Bidirectional Attention: Each word sees the entire sentence. Perfect for understanding context when you have the full input (like BERT, for classification, understanding).

Attention Matrix (Causal/Masked)

Each row shows how that word attends ONLY to itself and previous words. Future positions are masked (⨯).

the

black

cat

ate

fish

the

1.00

⨯

black

0.20

0.80

⨯

cat

0.10

0.85

0.05

⨯

ate

0.05

0.20

0.40

0.35

⨯

fish

0.05

0.10

0.20

0.50

0.15

Focus on "cat" row: Can ONLY attend to previous words "the" and "black". Future positions are masked (⨯).

⚡ Causal/Masked Attention: Each word can only see itself and previous words. Essential for text generation where you predict one word at a time (like GPT, Claude).

                                    Future tokens are masked with -∞ before softmax → attention weight becomes 0
                                

Real-World Impact

This is what powers ChatGPT, Claude, Gemini and most other modern AI language models. The attention mechanism lets them understand complex relationships between words, no matter how far apart they are in a sentence!

3The Three Components: Query, Key, Value

The Problem with Traditional Embeddings: In traditional word embeddings, each token gets just one static vector - but a single vector can't capture how words relate to each other in different contexts. The word "bank" means something different in "river bank" vs "bank account", but traditional embeddings can't adapt.

Traditional Embeddings vs Transformers

Traditional: Single Vector

cat →

                                    [0.2, -0.5, 0.8]
                                

Same vector regardless of context.
Can't ask questions or provide answers.

Transformers: Three Vectors

cat →

                                            Q: [0.1, 0.9, -0.2]
                                        

                                            K: [0.8, -0.3, 0.5]
                                        

                                            V: [0.3, 0.6, -0.1]
                                        

Dynamic representations for each role.
Can query, be queried, and provide values.

How Q, K, V Are Created

Each word's embedding is transformed by three learned matrices to create these specialized representations:

Query (Q)

W_Q matrix

"What am I looking for?"

Key (K)

W_K matrix

"What do I offer?"

Value (V)

W_V matrix

"What information do I contain?"

Understanding Embedding Values

Embeddings are learned representations where each dimension captures some learned concept. The numbers are relative strengths - higher means stronger association. In real models, there are 768-12,288 dimensions!

Example comparison:

animal-like

noun-like

domestic

cat

0.2

0.8

0.1

wolf

0.7

0.6

-0.4

runs

-0.1

-0.3

0.0

→ "cat" scores high on noun-like (0.8), "wolf" scores high on animal-like (0.7) and wild (-0.4), "runs" is a verb (low noun score)

↓ We'll use the cat embedding [0.2, 0.8, 0.1] in the transformation example below

Matrix Transformation Example

Let's see how the word "cat" gets transformed (simplified 3D example):

Embedding
"cat"
0.2
0.8
0.1
3 dimensions
×
WQ (weights)
learned, shared

                                            row 1[1.0, 0.2, 0.5]
                                        

                                            row 2[0.1, 1.5, 0.3]
                                        

                                            row 3[0.4, 0.2, 0.9]
                                        
3×3 weight matrix
=
Query Output
Qcat
0.41
1.25
0.33
3 dimensions

                                The Query represents "what this word is looking for" in other words
                            
Embedding
"cat"
0.2
0.8
0.1
3 dimensions
×
WK (weights)
learned, shared

                                            row 1[0.8, 1.1, 0.2]
                                        

                                            row 2[0.3, 0.7, 1.2]
                                        

                                            row 3[1.0, 0.1, 0.6]
                                        
3×3 weight matrix
=
Key Output
Kcat
1.06
0.74
0.34
3 dimensions

                                The Key represents "what type of information this word can provide"
                            
Embedding
"cat"
0.2
0.8
0.1
3 dimensions
×
WV (weights)
learned, shared

                                            row 1[0.6, 0.9, 1.1]
                                        

                                            row 2[1.3, 0.4, 0.7]
                                        

                                            row 3[0.2, 1.5, 0.3]
                                        
3×3 weight matrix
=
Value Output
Vcat
0.95
0.65
1.27
3 dimensions

                                The Value represents "the actual content/meaning this word contributes"
                            

What is W_Q Really Doing?

W_Q creates a "search pattern" by recombining embedding dimensions.

row 1 Row 1 of W_Q [1.0, 0.2, 0.5] creates the first Query dimension:

1.0×(0.2) + 0.2×(0.8) + 0.5×(0.1) = 0.41

0.2 (49%) 0.16 (39%) 0.05 (12%)

→ Creates a "search filter" heavily weighing animalness

Metaphor: W_Q turns "cat" into a magnet asking "I'm looking for animal-related, noun-like words"

What is W_K Really Doing?

W_K creates an "identity tag" by recombining embedding dimensions.

row 1 Row 1 of W_K [0.8, 1.1, 0.2] creates the first Key dimension:

0.8×(0.2) + 1.1×(0.8) + 0.2×(0.1) = 1.06

0.16 (15%) 0.88 (83%) 0.02 (2%)

→ Creates an "identity signal" emphasizing noun-ness with some animalness

Metaphor: W_K turns "cat" into a magnet offering "I provide noun-related, animal information"

What is W_V Really Doing?

W_V creates the "content package" that will be passed forward.

row 1 Row 1 of W_V [0.6, 0.9, 1.1] creates the first Value dimension:

0.6×(0.2) + 0.9×(0.8) + 1.1×(0.1) = 0.95

0.12 (13%) 0.72 (76%) 0.11 (11%)

→ Packages a balanced blend of all features for downstream layers

Metaphor: W_V packages "cat's" actual semantic content to be delivered based on attention weights

What Happens Next: Q · K^T

Now that we have Query, Key, and Value vectors for each word, we use them to calculate attention scores through dot products.

The Attention Score Formula

                            Attention(Q, K, V) = softmax(Q · KT / √dk) · V
                        

where d_k is the dimension of the key vectors (scaling factor)

Step 1: Q · K^T

Compare "cat's" Query with all Keys via dot product:

                                Qcat · Kthe = low score

                                Qcat · Kblack = high score

                                Qcat · Kcat = very high score

                                Qcat · Kate = medium score

                                Qcat · Kfish = medium score

Higher dot product = better match between what "cat" is looking for and what that word offers

Step 2: Softmax → Weights

Convert scores to probabilities (sum to 1):

the: 10%

black: 45%

cat: 30%

ate: 10%

fish: 5%

These are the attention weights!

Step 3: Weighted Sum of Values

Multiply each word's Value by its attention weight and sum:

                            Outputcat = 0.10×Vthe + 0.45×Vblack + 0.30×Vcat + 0.10×Vate + 0.05×Vfish
                        

The output for "cat" is now enriched with information from "black" (its most attended word) and other context!

Why This Works: The Query-Key dot product is high when both vectors point in similar directions - meaning "cat's" query matches well with "black's" key. This automatically determines which words provide the most relevant context!

5Live Attention Calculator

The Real Mechanism: Attention scores come from dot products between Query and Key vectors. Let's see how changing the Query or Key vectors affects attention!

Query vector for "cat"

                            Q_cat = [0.41, 1.25, 0.33]
                        

Adjust Q_cat[0]:

Adjust Q_cat[1]:

Adjust Q_cat[2]:

What's happening: When you adjust Q_cat values, you're changing what "cat" is "looking for". Higher values in dimensions where Key vectors also have high values = stronger attention!

Key vectors (fixed)

                            K_the = [0.1, 0.2, 0.1]

                            K_black = [0.8, 1.0, 0.3]

                            K_cat = [1.06, 0.74, 0.34]

                            K_ate = [0.4, 0.6, 0.8]

                            K_fish = [0.2, 0.3, 0.9]

Live Dot Product Calculation

Example: If you increase Q_cat[1] to match K_black[1] (both = 1.0), "cat" will pay more attention to "black" because they align in that dimension. This is how transformers learn semantic relationships!

Softmax Normalized Attention Weights:

Understanding Transformer Attention

What are we learning?

Learning Objective: Self-Attention

1What is Attention?

2What are Transformers?

How Transformers Differ from Earlier Models

Old Way (RNNs/LSTMs)

Transformer Way

The Key Innovation: Self-Attention

Real-World Impact

3The Three Components: Query, Key, Value

Traditional Embeddings vs Transformers

Traditional: Single Vector

Transformers: Three Vectors

How Q, K, V Are Created

Understanding Embedding Values

Matrix Transformation Example

What is W_Q Really Doing?

What is W_K Really Doing?

What is W_V Really Doing?

What Happens Next: Q · K^T

The Attention Score Formula

Step 1: Q · K^T

Step 2: Softmax → Weights

Step 3: Weighted Sum of Values

5Live Attention Calculator

Query vector for "cat"

Key vectors (fixed)

Live Dot Product Calculation

✓ Congratulations! You Understand Attention!

Key Takeaways

Understanding Transformer Attention

What are we learning?

Learning Objective: Self-Attention

1What is Attention?

2What are Transformers?

How Transformers Differ from Earlier Models

Old Way (RNNs/LSTMs)

Transformer Way

The Key Innovation: Self-Attention

Real-World Impact

3The Three Components: Query, Key, Value

Traditional Embeddings vs Transformers

Traditional: Single Vector

Transformers: Three Vectors

How Q, K, V Are Created

Understanding Embedding Values

Matrix Transformation Example

What is WQ Really Doing?

What is WK Really Doing?

What is WV Really Doing?

What Happens Next: Q · KT

The Attention Score Formula

Step 1: Q · KT

Step 2: Softmax → Weights

Step 3: Weighted Sum of Values

5Live Attention Calculator

Query vector for "cat"

Key vectors (fixed)

Live Dot Product Calculation

✓ Congratulations! You Understand Attention!

Key Takeaways

What is W_Q Really Doing?

What is W_K Really Doing?

What is W_V Really Doing?

What Happens Next: Q · K^T

Step 1: Q · K^T