Interactive demonstration of word embeddings and context-based learning
Human Embeddings
Before we dive into complex models, let's try something simpler. Look at this plot and tell me: where are "chicken" and "king"?
What are the X₁ and X₂ coordinates?
Look at the plot on the right. Where are "chicken" and "king" on this coordinate system?
Word Embedding Space
Congratulations, you just did text embeddings!
You just converted words into numbers! chicken = (3.5, 2.0) and king = (1.0, 5.0).
These coordinate pairs are called "embeddings" - they capture word meaning as numbers that computers can work with. Notice how similar words are close together in this space.
Understanding Meaning Dimensions
Each coordinate axis represents a meaning dimension - a continuous feature that tells us something about the word's meaning.
We want to represent words with a number of continuous features which tell us something about the meaning of the words
i.e. an embedding vector
X₂ Axis: Power/Status Dimension
↑
More power(royalty: king, queen, monarch)
Legal importance dimension?
↓
Less power(animals: rooster, chicken, hen)
X₁ Axis: Gender Dimension
←
More male(man, king, rooster)
→
More female(woman, queen, hen)
Why Start With Just 2 Dimensions?
Real word embeddings like those in GPT or Word2Vec use 300+ dimensions - imagine hundreds of meaning aspects like formality, emotion, concreteness, etc. We start with just 2 dimensions (power and gender) to build your intuition.
Same principles, different scale! Whether it's 2 dimensions or 300, the core concept is identical: words become vectors of numbers that capture semantic meaning.
Key insight:
Each dimension captures a different aspect of meaning. Instead of just saying "king is royal", we can say "king has a power value of 5.0 and a gender value of 1.0". This turns fuzzy concepts into precise numbers that computers can calculate with!
What are we learning?
Now that you've seen how words can be represented as coordinates, let's explore how computers learn these representations automatically! Our teaching example is loosely based on CBOW (Continuous Bag of Words), which learns word meanings by predicting a target word from its surrounding context.
Learning Objective: CBOW Model
Given context words (like "the ___ sat on the"), the computer predicts the missing word ("cat"). By training on this task, words that appear in similar contexts develop similar numerical representations, capturing semantic relationships.
Why CBOW works:
Words that appear in similar contexts tend to have similar meanings. "cat" and "dog" both appear after "the" and before "sat", so they develop similar embeddings. This distributional hypothesis forms the foundation of modern word embeddings.
Our Training Dataset
Our CBOW model was pre-trained on thousands of sentences from news articles and books. The full dataset contains rich contexts for animals (cat/dog), rulers (king/queen), cities (Paris/London), and various actions. Below is a sample of the vocabulary learned from this corpus.
Why word IDs matter:
Computers can't work with words directly - they need numbers. Each word gets a unique ID (like "cat" → 2, "sat" → 3). These IDs are used to look up the word's embedding vector during training and prediction.
Select Training Sentence
Choose which sentence the computer should learn from.
Context Window
Look at 5 words on each side of our target word.
Target Word Position
Which word are we trying to predict?
Embedding Dimensions
Each word gets 5 numbers to describe it.
1Breaking Down the Sentence
First, we identify our context words (the clues) and our target word (what we're trying to predict). Context words are colored blue, target word is purple.
Process explanation:
The model uses context words (blue) to predict the target word (purple) through learned associations.
2Training the Model
Each training step improves the embeddings by adjusting them based on prediction errors. Words that appear in similar contexts will gradually develop similar vector representations.
What to observe during training:
Loss decreasing: The model is getting better at predictions
Similarity changes: Words in similar contexts become more similar
Embedding evolution: Watch the numbers in the embedding matrix change
Prediction improvements: The model should predict the target word with higher confidence
Training Controls
Watch the model learn meaningful word relationships!
Training...
Learning Rate
How fast to learn: 0.5
Training Progress
Step: 0
Loss: -
Training Status
Ready to start training
Semantic clustering: Not trained
How training works:
The model starts with random embeddings. When it makes wrong predictions, it adjusts the embeddings slightly. Over many steps, words that appear in similar contexts develop similar embeddings - this is how meaning emerges from statistics!
Training Results: Watch Embeddings Learn!
Live Training View: Watch as random numbers transform into meaningful word embeddings! The matrix shows the actual numbers, while the visualization shows how words cluster by meaning.
Word Embedding Matrix
Each row is a word, each column is a "trait" dimension. Watch these numbers evolve during training!
2D Semantic Visualization
Words projected into 2D space. Similar words cluster together as the model learns!
Note: Your embeddings have dimensions, but we're showing them in 2D using PCA-like projection to visualize relationships.
What you're seeing:
Left (Matrix): The raw numbers that represent word meanings. Similar words develop similar number patterns.
Right (Visualization): The same data projected to 2D space. Watch words move from random positions into semantic clusters during training!
Interpretability
Now let's see what real word embeddings actually look like! In practice, embeddings have hundreds of dimensions (300-1024 is common). While most dimensions are hard to interpret, some clearly capture human-understandable concepts like gender, age, or social status.
Key Insight: Emergent Interpretability
Even though computers learn embeddings without being told what concepts to capture, some dimensions naturally emerge that correspond to meaningful semantic properties. This shows that mathematical word representations can capture real-world knowledge!
Interactive Real Embeddings Matrix
Click on dimension buttons to see how different concepts light up across words. Colors represent embedding values: red = high positive, blue = high negative, white = near zero.
Understanding the Matrix:
Rows = Different words | Columns = Individual embedding dimensions (300+ total) | Colors = Strength and direction of each feature
Notice how certain dimensions consistently activate for semantically similar words - this is how computers learn to group related concepts!
From Toy Examples to Real AI
This is exactly how large language models like GPT work internally - they learn hundreds or thousands of interpretable and uninterpretable features that capture all aspects of human language and knowledge. Your 2D/3D understanding scales to these high-dimensional spaces!
Summary
This demonstration illustrated how computers learn word relationships through predictive modeling:
Training process: Random embeddings gradually improve through prediction tasks
Semantic clustering: Words appearing in similar contexts develop similar numerical representations
Contextual learning: The training objective forces related words to cluster together
Vector semantics: Word meanings become mathematical objects that can be manipulated computationally
By training on the simple context prediction task, the model discovers meaningful semantic relationships. This approach forms the foundation of modern natural language processing and large language models.