How AI captures the meaning of language
We've learned how neural networks process data using matrices and matrix multiplication. But here's the key insight: neural networks only understand numbers, while human language is made of words, sentences, and meaning.
Words carry meaning, context, relationships
How do we translate?
Numbers that can be multiplied and transformed
How do we convert the word "cat" into a list of numbers that captures its meaning in a way that neural networks can understand? This is what embeddings solve.
We're going deep into embeddings - understanding not just what they are, but how they work, why they work, and how modern AI systems use them. This chapter covers:
How can geometry represent meaning?
The mathematics of semantic closeness
Breaking text into processable pieces
Where the magic numbers live
Teaching models about word order
How embeddings learn and what they enable
Picture a map where every word has a specific location, and:
This "meaning map" is exactly what a vector space is. Each word becomes a point in a high-dimensional space where geometry - distances and angles - encodes semantic relationships.
Our map visualization is 2-dimensional (left-right, up-down). But to capture the full richness of language, we need many more dimensions.
Can capture some relationships, but very limited
Better, but still insufficient for language
What modern models actually use
More dimensions = more capacity to capture subtle semantic nuances
Imagine you have thousands of songs, and you want to organize them so that similar songs are easy to find. You need to convert each song into numbers that capture its essence. What kind of numbers should you use?
You decide to give each song a unique ID: Song #1, Song #2, Song #3, and so on. Let's see how this works for three songs:
The Insight: When you search for songs similar to "Rock Ballad" (#4,532), the computer has no way to know that "Soft Rock" (#12,445) is related. The ID numbers are just arbitrary labels — they don't capture the musical similarity.
Instead of arbitrary IDs, what if each song got a list of numbers representing its characteristics? Slow songs get similar "tempo" numbers, guitar-heavy songs get similar "instrument" numbers, and so on.
Why This Works: The numbers directly represent musical properties. Both rock ballads have similar tempo values (0.234 and 0.241) — they're practically identical! The EDM track (-0.678) is clearly different. Now when you search for similar songs, the computer can use these numbers to find matches. This is exactly how embeddings work for words!
You might wonder: why do we need negative numbers? Why not just use 0 to 1? The answer: direction matters for capturing opposite meanings.
Real example: Across 768 dimensions, each dimension might capture a different semantic aspect: temperature, emotion, time, size, formality, concreteness, etc. Using negative and positive values lets the model represent rich, nuanced relationships.
Values like 0.234, 0.267, 0.241 can be very close, capturing subtle similarities
-0.8 vs +0.8 captures opposite meanings along each dimension
Each dimension captures a different semantic aspect, combining to represent complex meaning
Now that we understand continuous vectors, let's look at real embeddings. An embedding is simply a list of continuous numbers - a vector:
"cat" and "dog" have similar numbers because these words appear in similar contexts during training. "democracy" has very different numbers because it appears in completely different contexts. This isn't programmed - it emerges from learning!
We said "cat" and "dog" are "close" in vector space. But what does that mean mathematically? We need a precise way to measure similarity between vectors.
Imagine two people standing in a room. How do we measure if their positions are "similar"?
Both work, but for embeddings we typically care about direction more than absolute distance.
The most common way to measure similarity is cosine similarity. It measures the angle between two vectors.
Small angle = Similar words
Large angle = Different words
Result range: -1 to 1
Imagine an AI system that processes thousands of messages daily. It needs to understand that "refund", "re-fund", and "refunding" are related. But how does AI break down text into processable pieces?
But is splitting by spaces always the answer? Modern AI uses more sophisticated approaches.
Byte-Pair Encoding (BPE) is the most popular tokenization method in 2025. It learns an optimal vocabulary by iteratively merging the most frequent character pairs. Let's see the exact algorithm:
Start with all unique characters in your training data:
For corpus: "refund refund refunding fund funding"
Add merged pair to vocabulary:
Continue merging until vocabulary reaches desired size (typically 32,000-50,000 tokens)
At each iteration, BPE selects the pair (x, y) that maximizes frequency:
This greedy approach ensures the most common patterns are captured first, leading to efficient tokenization of frequent words and morphemes.
AI systems often need to understand messages about products like "SuperWidget-3000" or technical terms like "API integration". Subword tokenization excels here:
Unknown token - meaning lost!
Meaning preserved through components!
We've tokenized our text into discrete units. Now we need to convert each token into a vector of numbers. This is where the embedding matrix comes in - it's essentially a giant lookup table that maps token IDs to vectors.
Once the AI has processed "refund" into token ID 5432, it needs the vector representation. Think of the embedding matrix like a phonebook: you look up a name (token ID) and get back contact information (the embedding vector).
Given a token ID, we simply retrieve the corresponding row from the embedding matrix:
Example: E[5432] returns the 5432nd row, which is the embedding for "refund"
Before we dive deeper into embeddings, we need to understand one-hot encoding - the mathematical representation of categorical data that makes the lookup operation work.
One-hot encoding represents a token as a vector with all zeros except for a single 1 at the position corresponding to that token's ID.
Here's the key insight: Looking up an embedding is mathematically equivalent to multiplying the one-hot vector by the embedding matrix!
Shape: (1 × vocab_size) = (1 × 5)
Shape: (vocab_size × embedding_dim) = (5 × 3)
The one-hot vector with 1 at position 2 selects the 2nd row of E!
In practice: We skip creating the one-hot vector and directly index into the embedding matrix (it's more efficient). But mathematically, they're equivalent!
| Feature | One-Hot Encoding | Learned Embeddings |
|---|---|---|
| Dimensionality | vocab_size (32,000+) | Compact (768) |
| Semantic Similarity | None - all words equally distant | Captures meaning - similar words are close |
| Sparsity | 99.997% zeros (sparse) | Dense - every dimension used |
| Parameters | 0 (fixed representation) | Learned from data (trainable) |
| Generalization | Poor - can't handle unseen words | Good - subword tokens help |
AI systems use embeddings to understand that these messages express similar sentiment:
Consider these two messages:
When we convert tokens to embeddings, we lose all information about where the token appears in the sequence. The embedding for "customer" is identical whether it's the first word or the last word.
Embeddings have no position information built in!
Positional encodings solve this by adding position information to our embeddings. The idea is brilliantly simple: create special vectors that encode position, and add them to the word embeddings.
By adding positional information, each token now has both meaning and position!
The original Transformer paper (2017) introduced sinusoidal positional encodings - a clever mathematical approach that doesn't require training. Here are the formulas:
Let's compute positional encoding for position 0 with d=768:
Result for position 0: PE(0) = [0.0, 1.0, 0.0, 1.0, …]
Now let's compute for position 1:
Result for position 1: PE(1) = [0.841, 0.540, 0.828, 0.561, …]
Notice: Different from position 0, giving each position a unique signature!
Different positions produce different patterns of sine and cosine values
Adjacent positions have similar encodings (position 5 and 6 are close)
Sin and cos always output values between -1 and 1 (same range as embeddings)
Can handle sequences longer than what was seen during training
PE(pos+k) can be expressed as a linear function of PE(pos), helping the model learn relative positions
Here's what the first 4 dimensions look like across 10 positions:
Each position gets a unique pattern - like a barcode for position!
While sinusoidal encodings are still widely used, modern research has introduced alternatives that can perform better for specific tasks:
Approach: Treat position encodings as trainable parameters, just like word embeddings.
Approach: Encode position by rotating the embedding vectors in a high-dimensional space.
2025 Status: Dominant choice for large language models
Approach: Instead of adding to embeddings, directly bias the attention scores based on distance.
Approach: Encode the relative distance between tokens rather than absolute positions.
In real-world applications, positional encodings enable understanding of:
The AI understands this is a sequence of events in time order.
"that I ordered last week" modifies "product" - position helps track these relationships.
The word "Not" at the beginning flips the sentiment of the entire phrase.
Word order distinguishes questions from statements - critical for routing!
Bottom Line: Positional encodings are the secret ingredient that allows transformers to understand that language is not just a bag of words - it's a sequence where order carries meaning.
We've seen what embeddings are and how they're used. But where do these magical numbers come from? How does the model learn that "cat" and "dog" should have similar embeddings?
Words that appear in similar contexts tend to have similar meanings. This is called the distributional hypothesis, and it's the foundation of embedding learning.
"cat" and "dog" appear with similar surrounding words: "The ___ is sleeping", "The ___ chased". The model learns to give them similar embeddings because they appear in similar contexts.
Embeddings are learned as part of larger model training. Here are the main approaches used in 2025:
Train the model to predict the next word in a sequence. This is the dominant approach for modern LLMs.
The embedding layer is the first layer of the model. As the model learns to predict words, the embeddings automatically learn to capture meaning!
Randomly mask out words in a sentence and train the model to predict them from context.
Unlike GPT which only sees past context, BERT sees both past and future context (bidirectional).
Train embeddings to be similar for related items and different for unrelated items.
The embedding landscape has evolved dramatically. Here are the leading models as of 2025:
Best for: General text similarity, semantic search
Best for: Multilingual applications, classification
Best for: Domain-specific retrieval, long documents
Best for: Self-hosted, privacy-sensitive applications
Best for: High-quality open-source option
MTEB (Massive Text Embedding Benchmark): The standard benchmark for evaluating embedding models across 56 datasets covering 8 tasks (classification, clustering, retrieval, etc.). Scores are out of 100. As of October 2025, top models score in the mid-60s.
Embeddings power countless AI applications. Here are the major use cases in 2025:
Convert documents and queries to embeddings, then find relevant documents using cosine similarity. This enables AI systems to retrieve information based on meaning rather than exact keyword matches.
Embed items and users, recommend items whose embeddings are close to the user's preferences.
Use embeddings as input features to classifiers for tasks like spam detection, sentiment analysis, topic classification.
Group similar documents by clustering their embeddings to discover themes and patterns.
Find near-duplicate content by computing embedding similarity.
Multilingual embeddings map words from different languages to the same vector space.
Let's bring it all together. Here's how embeddings work in a modern support system:
You've now learned the complete story of how words become numbers that AI can understand:
The Magic: Embeddings transform the discrete, symbolic world of language into continuous geometric space where meaning becomes distance, similarity becomes proximity, and relationships become directions. This is how modern AI bridges the gap between human communication and machine computation.
Core concepts from this chapter:
Neural networks only understand numbers. Embeddings convert words, tokens, and concepts into dense vectors (lists of numbers) that capture semantic meaning in a way machines can process.
Before embedding, text must be split into tokens. Byte-Pair Encoding (BPE) finds common patterns and creates a vocabulary that balances flexibility (handling new words) with efficiency (common words get single tokens).
Embeddings are learned from massive text data. Words that appear in similar contexts (like "cat" and "dog") end up with vectors that are close together in the embedding space, capturing semantic relationships.
Embeddings alone don't capture word order. "The customer canceled the refund" and "The refund canceled the customer" would look identical. Positional encodings add position information so models understand sequence order.
In embedding space, similarity becomes distance. Close vectors = similar meaning. This geometric structure enables powerful operations: vector arithmetic, similarity search, and clustering related concepts.
Every modern language model uses embeddings: search engines, chatbots, translation systems, recommendation engines. They're the input layer that makes language processing with neural networks possible.
Key Transformation:
Text → Tokens → Embeddings → Embeddings + Positional Encodings → Neural Network Input
This pipeline transforms discrete symbols (words) into continuous geometric space where meaning becomes computable.