Matrix Multiplication in LLMs

Understanding how neural networks transform information through matrix operations

Why Matrix Multiplication Matters

Every time you use ChatGPT, Claude, or any Large Language Model, billions of matrix multiplications happen behind the scenes. Each multiplication transforms your input text into new representations, building up layers of understanding until the model can generate a response.

This demo will help you understand what matrix multiplication actually does (not just how to compute it) and why it's the perfect operation for neural networks.

1 One Token, Three Features

A helpful fiction: Real LLMs use vectors with 768 to 12,288+ dimensions, not 3. And those dimensions don't map neatly to human concepts—they're abstract patterns learned from training data. But to build intuition, let's pretend we can label them with made-up meanings.

Imagine a single token is represented by just 3 numbers, where each number captures some aspect of the word:

Token "running"

plural-ish

past-ish

action-ish

x = [2, 1, 3]

Remember: "plural-ish" and "action-ish" are fictional labels we invented! In real models, dimensions don't have neat meanings. But the math works identically—so this intuition will serve you well.

2 Weight Matrix: Learned Pattern Detectors

Now we want to create 2 new features from those 3 inputs. The weight matrix W contains learned patterns—each row is like a "lens" or "question" the model asks about the input:

Weight Matrix W

plural-ish

past-ish

action-ish

-1

"intensity" detector

"narrative" detector

Row 1: [1, 0, 2]

Cares about plural (×1), ignores past (×0), strongly weights action (×2)

→ Maybe detecting: intensity or energy?

Row 2: [-1, 3, 1]

Penalizes plural (×-1), strongly weights past (×3), slightly weights action (×1)

→ Maybe detecting: storytelling mode?

Reality check: These interpretations ("intensity", "narrative") are stories we're telling ourselves! In real networks, weights are learned automatically and don't have clean meanings. But thinking this way helps build intuition for how weighted combinations create new representations.

Analogy: Recipes

x = your ingredients (2 cups flour, 1 egg, 3 cups sugar)

Each row of W = a recipe card that says how much of each ingredient to use

y = the dishes you make using those recipes

3 The Dot Product: Measuring Alignment

The dot product answers the question: "How aligned is x with this learned pattern?"

Geometrically, if x and a row of W point in similar directions, the dot product is large. If they're perpendicular, it's zero. If they point opposite ways, it's negative.

Interactive Dot Product Calculator

Watch how each element of x combines with each weight to produce a single output number:

x (input)

W row 1

y₁

2 × 1 = 2

1 × 0 = 0

3 × 2 = 6

Sum: 2 + 0 + 6 = 8

y₁ = 8 → Strong "intensity" signal (in our fiction)

Every output feature depends on ALL input features. Meaning emerges from combinations, not from single values "becoming" other values. The weight matrix W defines what counts as a meaningful combination.

4 The Dimension Rule: Inner Dimensions Must Match

Critical rule: When multiplying matrices, the inner dimensions must match. If X is (rows₁ × cols₁) and W is (rows₂ × cols₂), then cols₁ must equal rows₂.

Dimension Checker

Try different dimensions and see if the multiplication is valid:

X: ×

W: ×

Y: 1 × 2

✓ Valid! Inner dimensions match (3 = 3). Result will be 1 × 2.

Why must dimensions match?

Think of it like a zipper: each element in a row of X needs a partner element in a column of W to multiply with. If there are 3 elements in the row but only 2 in the column, some elements have no partner!

5 Multiple Tokens: Batch Processing

In practice, LLMs process many tokens at once. Each row of X is a different token:

X (2 tokens × 3 features)

plural

past

action

token 1 →

token 2 →

Wᵀ (3 inputs × 2 outputs)

lens 1

lens 2

Y (2 tokens × 2 new features)

output 1

output 2

Calculating y₁₁:

6 Build Your Own: Drag & Drop

Drag numbers from the bank into the matrices to see how different values affect the output. This helps build intuition for how weight matrices transform inputs.

Number Bank

Drag these numbers into any matrix cell below:

-2

-1

Input Token x

Try different input features

Weight Matrix Wᵀ

-1

Adjust the "lenses"

Output y

Computed automatically

How y is computed:

y₁ = 1×2 + 2×1 + 1×0 = 4

y₂ = 1×0 + 2×(-1) + 1×3 = 1

Experiment ideas:

Set a weight column to all zeros—what happens to that output?
Make all weights positive vs. having negative weights
Set the input to [1, 0, 0] to see what the first column of W contributes

7 Test Your Understanding

Quiz: Fill in the Result

Given these matrices, calculate the missing values in Y:

Hint: y₁₁ = (row 1 of X) · (col 1 of W) = 1×2 + 2×1 = ?

8 In Real LLMs

How This Connects to Transformers

In transformer models (GPT, Claude, LLaMA, etc.), matrix multiplication is everywhere:

Query Matrix (W_Q)

Each row defines one query dimension. Applied to tokens to create "what am I looking for?" vectors.

Key Matrix (W_K)

Each row defines one key dimension. Applied to tokens to create "what do I contain?" vectors.

Value Matrix (W_V)

Each row defines one value dimension. Applied to tokens to create "what information to pass along" vectors.

The key insight:

W is not a lookup table from meanings to meanings. It's a set of learned lenses. Each lens looks at the entire input and asks: "Do I see my pattern?"

Real LLMs use embedding dimensions of 768, 1024, 4096, or even larger—but the principle is identical to our 3→2 example!

Key Takeaways

Matrix multiplication transforms features: We're not keeping the original features—we're building new, more useful ones for the next layer.
Each row of W is a pattern detector: It asks "how aligned is the input with this learned direction?"
Every output depends on ALL inputs: Meaning comes from combinations, not single values.
Inner dimensions must match: (m × n) × (n × p) = (m × p)
The dot product measures alignment: High values mean the input matches the pattern; low/negative means it doesn't.