How AI captures the meaning of language
We've learned how neural networks process data using matrices and matrix multiplication. But here's the key insight: neural networks only understand numbers, while human language is made of words, sentences, and meaning.
Words carry meaning, context, relationships
How do we translate?
Numbers that can be multiplied and transformed
How do we convert the word "cat" into a list of numbers that captures its meaning in a way that neural networks can understand? This is what embeddings solve.
We're going deep into embeddings - understanding not just what they are, but how they work, why they work, and how modern AI systems use them. This chapter covers:
How can geometry represent meaning?
The mathematics of semantic closeness
Breaking text into processable pieces
Where the magic numbers live
Teaching models about word order
How embeddings learn and what they enable
Remember vector spaces from Chapter 5? We represented customers as vectors in 2D space:
Each dimension represented a measurable feature (subscription length, usage). We used vector math to find patterns and make predictions.
Embeddings use the exact same mathematical principles, but instead of representing customer features, they represent word meanings. Instead of 2 dimensions (months, hours), we have 768+ dimensions that capture semantic relationships.
Let's see how this works...
Picture a map where every word has a specific location, and:
This "meaning map" is exactly what a vector space is. But what actually makes it a "vector space"? Let's understand the mathematical properties that make this possible.
You already learned the two fundamental rules of vector spaces in Chapter 5:
The mathematical rules are identical. What changes is what the vectors represent:
Because embeddings live in a vector space with these same mathematical properties, AI models can:
Our map visualization is 2-dimensional (left-right, up-down). But to capture the full richness of language, we need many more dimensions.
Can capture some relationships, but very limited
Better, but still insufficient for language
What modern AI models actually use
Quick context: Embeddings are vectors (lists of numbers) that represent word meanings in AI models. The "dimension" is how many numbers are in each vector — more numbers means more capacity to capture subtle meanings.
Different AI models use different embedding dimensions. Larger dimensions can capture more nuanced meanings, but require more computational resources. Here's what major models use:
More dimensions = more capacity to capture subtle semantic nuances, but also more computational cost
Throughout this chapter, we'll use 1,024 dimensions in our examples because they're easier to visualize and understand. Just remember that production embedding models typically use anywhere from 1,024 to 3,072 dimensions depending on their use case!
Imagine organizing thousands of songs so similar ones are easy to find. You need to convert each song into numbers that capture its characteristics. Here's the challenge:
Think about these three songs:
❓ Question: Can you represent each song with ONE number that captures similarity?
Let's try placing each song on a number line (0.0 to 1.0):
⚠️ The Fundamental Limitation
Slow Rock is similar to Slow Classical in tempo (both slow),
BUT similar to Fast Rock in genre (both rock).
One number forces you to choose: sort by tempo OR genre, not both!
Songs vary across MANY independent dimensions: tempo, genre, mood, instruments, vocals, era, energy...
One number can't capture multiple independent characteristics simultaneously!
Instead of one number, represent each song as a vector — a list of numbers where each dimension captures a different characteristic:
Now the computer can calculate similarity! These two songs have similar values for dimensions 1 & 2 (both rock), but different values for dimensions 3 & 4 (tempo/vocals). The computer computes the mathematical distance between the full 768-dimensional vectors.
The numbers directly represent musical properties. Both rock ballads have similar tempo values (0.234 and 0.241) — they're practically identical! The EDM track (-0.678) is clearly different. Now when you search for similar songs, the computer can use these numbers to find matches. This is exactly how embeddings work for words!
You've seen that embeddings are vectors: [0.234, -0.512, 0.891]. But why use decimal numbers? Why not simpler integers like [1, 2, 3]?
After all, [1, 2, 3] is cleaner, easier to store, and faster to process. What's wrong with discrete values?
Imagine you're training a model. It predicts "cat" when the correct answer is "dog". The model needs to adjust the embedding for "cat" to make it closer to "dog".
With discrete numbers [1, 2, 3]:
The learning algorithm calculates a gradient — a direction that says "move this number up by 0.03" or "move that number down by 0.17". But with discrete numbers, you can't move by 0.03. You can only jump by 1.
With continuous numbers, the model can move in tiny steps:
With continuous numbers [0.234, -0.512, 0.891]:
Each adjustment is small and controlled. The model follows the gradient like a ball rolling downhill, gradually finding better representations. This is how neural networks learn — through millions of tiny gradient-based updates.
Neural network training uses calculus. Specifically, it computes derivatives (rates of change). Derivatives only exist for continuous functions.
This is why all learnable parameters in neural networks (weights, biases, embeddings) use floating-point numbers. Learning is optimization through gradients, and gradients require continuity.
Continuous numbers aren't a choice — they're a requirement for learning.
Without smooth values, there are no gradients. Without gradients, there is no training.
You don't manually assign these numbers. Instead, AI trains on millions of examples (e.g., "users who liked Song A also liked Song B") and automatically learns what numbers to assign to make similar items end up close together in vector space.
About the Music Example:
This example uses interpretable dimensions ("tempo", "guitar-ness") for teaching purposes. Real embeddings have latent features learned by AI — you can't point to dimension #347 and say "this represents pluralization." The patterns are complex and distributed across many dimensions. The music analogy helps understand the concept, but actual embeddings are learned representations, not hand-crafted features.
Because these are numbers, not arbitrary labels, the computer can use mathematical formulas to calculate how similar two items are. Looking at [0.234, -0.512, 0.891] and [0.241, -0.505, 0.887], the computer can measure: "These numbers are almost the same, so these songs must be similar!"
We'll explore the exact mathematical formulas for measuring similarity in the next section!
You might wonder: why do we need negative numbers? Why not just use 0 to 1? The answer: direction matters for capturing opposite meanings.
Real example: Across 768 dimensions, each dimension might capture a different semantic aspect: temperature, emotion, time, size, formality, concreteness, etc. Using negative and positive values lets the model represent rich, nuanced relationships.
The -1 to +1 scale shown here is for illustration. In practice, raw embedding values are unbounded — they can be any positive or negative number. However, many models apply normalization (like L2 normalization) which scales the entire vector so its length equals 1, bringing individual values roughly into the -1 to +1 range. This normalization helps with:
So while we use -1 to +1 for teaching, real embeddings before normalization can have values like -47.3 or +12.8!
Now that we understand continuous vectors, let's look at real embeddings. An embedding is simply a list of continuous numbers - a vector:
Explore how words cluster in 2D vector space! Click words to see their relationships.
Similar words (cat, dog, kitten) cluster together!
Different concepts (animals vs politics) stay far apart.
This is how AI understands meaning through geometry.
We said "cat" and "dog" are "close" in vector space. But what does that mean mathematically? We need a precise way to measure similarity between vectors.
Imagine two people standing in a room. How do we measure if their positions are "similar"?
Both work, but for embeddings we typically care about direction more than absolute distance.
The most common way to measure similarity is cosine similarity. It measures the angle between two vectors.
Small angle = Similar words
Large angle = Different words
Cosine similarity gives you a score from -1 to 1:
• 1 = Vectors point same direction (very similar meanings)
• 0 = Vectors perpendicular (unrelated)
• -1 = Vectors point opposite directions (opposite meanings)
The math handles the complexity — you just need to know higher scores = more similar!
Result range: -1 to 1
Calculate similarity step-by-step! Adjust vectors and watch the formula come to life.
Cosine similarity compares vector directions to measure semantic similarity.
When words have similar meanings, their embeddings point in similar directions, producing cosine scores close to 1.
Cosine similarity is widely used in semantic search, retrieval systems, and recommendation engines across the AI industry.
As we explored in Chapter 1, AI finds patterns in data through numerical processing. To leverage this capability for language, we transform text into vectors through a series of steps:
Let's understand each step of this pipeline, starting with: How do you break text into pieces?
Let's use this sentence throughout to see how different approaches work:
The Idea: Just like a keyboard has ~100 keys (letters, numbers, punctuation) but you can type anything, what if our dictionary just stored individual characters? Then we could "spell out" any word!
The Idea: Instead of individual letters, what if we stored every complete word? "play", "playing", "player", "players", "replay", "replaying" — each gets its own spot in the dictionary.
Analogy: Like a keyboard with 500,000 keys — one for every word. Fast, but impractical and can't type new words!
Attempt 1 was too small. Attempt 2 was too big. Let's look closely at some words and see if we notice anything interesting...
Do you see it? The piece play keeps showing up!
Same with ing, er, re...
Instead of storing 500K whole words, what if we just stored these common pieces?
BPE automatically discovers these frequent patterns by analyzing millions of sentences.
It's the same pattern recognition from Chapter 1 — but applied to finding reusable text pieces.
BPE automatically discovers these common pieces by analyzing text and repeatedly merging the most frequent adjacent character pairs.
pl pla play Result: A vocabulary of frequently-used pieces that can combine to form any word, including ones never seen during training.
Who decides "desired size"? The model designers choose vocabulary size during training based on tradeoffs: smaller vocabularies (32K-50K tokens) are faster but produce longer sequences, while larger ones (100K-256K tokens) are slower but handle multilingual text more efficiently. Most modern LLMs use 32K-50K for English-focused tasks or 100K-256K for multilingual support.
Start with a training corpus C and an initial vocabulary V₀ containing all unique characters:
Each word is split into individual characters: "low" → ['l', 'o', 'w']
For each iteration i, count all adjacent symbol pairs in the corpus:
Select the most frequent pair and merge it into a new symbol:
Repeat steps 2-3 until vocabulary reaches desired size |V| = k:
To tokenize a new word, apply learned merges in the same order:
Modern LLMs like GPT-4, Claude, and others use vocabularies in this range — small enough to be efficient, large enough to be expressive.
Want to explore tokenizers yourself?
Now that we have tokens ["play", "ing"], we need to convert them to numbers.
When the tokenizer is trained using BPE (as described above), each discovered token is assigned a sequential integer ID as it's added to the vocabulary:
The ID number simply reflects when that token was added to the vocabulary during training. "play" might get ID 258 if it was the 259th token discovered (starting from 0).
Once trained, this vocabulary is fixed. When you type "play", the tokenizer looks it up and finds ID 258.
Each token gets a unique integer ID from the vocabulary:
As we saw with the music library analogy, a single number cannot capture semantic similarity. Instead, each token receives a vector of numbers (typically 256-4096 dimensions), where similar words naturally receive similar vectors.
Embeddings learn through the same pattern recognition process covered in Chapter 1 — the model adjusts embedding weights to minimize errors using gradient descent and loss functions (introduced in Chapter 2).
Want to fine-tune pre-trained models for your specific needs? See Chapter 5: Fine-tuning & Model Adaptation.
A model with 50,000 tokens, each represented by a 1,024-dimensional vector, needs to store and retrieve 51.2 million numbers. The model must access these embeddings thousands of times per second during inference.
The question: How do we organize these embeddings so the model can instantly retrieve the vector for any token ID?
All token embeddings are stored in a single two-dimensional matrix called the embedding matrix. The simplest way to understand it: imagine a giant Excel spreadsheet.
Just like finding a value in Excel: if you want the embedding for "cat" (token ID 512), jump directly to row 512 and grab all 1,024 numbers across that row.
The spreadsheet analogy reveals why LLMs are so fast. When you need the embedding for "cat" (token ID 512):
This is exactly like Excel's VLOOKUP — you don't scan every row, you jump directly to the row number. Whether the vocabulary has 50,000 tokens or 1 million tokens, accessing any row takes the same instant.
The Key Insight:
When ChatGPT processes your 50-word message, it performs 50 row lookups in this embedding matrix. With direct row access, this happens in milliseconds. If it had to search through rows sequentially, each lookup would slow down proportionally to vocabulary size — turning instant responses into multi-second delays.
Theoretically, embedding lookup is a linear transformation. For vocabulary size V and embedding dimension d, we have an embedding matrix E ∈ ℝV×d.
To retrieve the embedding for token with index i, we could use a one-hot vector:
When you multiply a one-hot vector by the embedding matrix, the result is mathematically equivalent to extracting the i-th row. This is because all terms vanish except where xi = 1.
Forward pass computational cost:
Memory overhead:
In production systems processing thousands of tokens per second, this memory overhead becomes prohibitive. NLP vocabularies can reach 100K-1M tokens, making one-hot encoding impractical.
Modern frameworks recognize this inefficiency and implement embedding lookup as direct array access:
This operation has O(1) time complexity — constant time regardless of vocabulary size — and zero computational waste.
During training, gradients flow back through the embedding layer. The key insight: only the accessed row receives gradients.
This sparse gradient structure is why embedding layers are efficient to train. In a batch of 512 tokens from a 50K vocabulary, only ~512 rows receive gradient updates (assuming unique tokens), while 49,488 rows require no computation.
Advanced frameworks like PyTorch implement sparse gradient updates to further optimize this — gradients are stored as sparse tensors, updating only the non-zero entries rather than the full matrix.
The mathematical equivalence means embeddings are theoretically a linear transformation, but the implementation efficiency makes modern NLP possible. Without direct indexing:
AI doesn't memorize every word.
Instead, it discovers frequent patterns (like "play", "ing", "er") and combines them like building blocks.
With ~100K token pieces, it can understand billions of word combinations.
"playing cats" → Raw input text
BPE segmentation: ["play", "ing", "cat", "s"] → Reusable pieces
Vocabulary lookup: [42, 87, 512, 23] → Unique identifiers
Matrix lookup: Dense vectors → Numbers that capture meaning
Semantic representation: Similar meanings cluster together
This is how AI bridges the gap between human language and mathematical computation.
Tokenization impacts model cost, speed, and fairness across languages. The embedding matrix enables fast, constant-time lookups. Together, they convert billions of word combinations into a space where AI can find patterns and understand meaning.
We've built an amazing system: text becomes tokens, tokens become embeddings, embeddings capture meaning. But there's a critical problem we haven't addressed yet.
Consider these two sentences:
Wait... Stop and Think
Both sentences use the exact same tokens: ["the", "dog", "chased", "the", "cat"]. Just in different order.
If we look up embeddings for these tokens, we get:
• "the" → [0.12, -0.45, ...]
• "dog" → [0.79, -0.28, ...]
• "chased" → [0.34, 0.67, ...]
• "cat" → [0.82, -0.31, ...]
But these embeddings are the same regardless of where the word appears!
The embedding for "dog" is identical whether it's at position 1 or position 4.
So how does the model know "dog" comes before "chased" vs after?
When we convert tokens to embeddings, we lose information about word order. The model receives a collection of meaning vectors but has no indication which came first, second, or third.
This poses a significant challenge for language understanding. Word order determines who did what to whom, whether something happened in past or future, and countless other critical distinctions.
We need to add position information to each token's embedding. The approach: add a position-specific vector to each embedding that encodes "this token is at position 0", "this is at position 1", etc.
Now the same word gets different representations based on where it appears:
Here's the brilliant idea: Create special "position vectors" and add them to the word embeddings!
Let's say we have the sentence: "The dog sleeps"
Notice: The embedding for "dog" is the same whether it's at position 1, position 5, or position 100!
For each position (0, 1, 2, 3, ...), we create a unique "position vector" - a special pattern of numbers:
Each position gets its own unique pattern - like a barcode for that position!
Now we simply add the word embedding and position vector together:
✓ Success! This final vector now contains BOTH the meaning of "dog" AND the fact that it's at position 1!
The same word at different positions gets different final vectors!
This is exactly what we need to solve the word order problem.
✓ Notice: "dog" at position 1 has different numbers than "dog" at position 4!
The AI can now tell the difference between "dog bites man" and "man bites dog"
We need a way to encode position information. There are different approaches used by different models:
Note: The "add position vector" approach we explained is conceptually simpler and used by the original Transformer. Modern models like LLaMA use rotation-based methods (RoPE) for better performance, but the core idea remains: inject position information so the model knows word order!
Different approaches to encoding position information have emerged over time, each with tradeoffs:
Current State (2025): RoPE is the default positional strategy in most modern LLMs (LLaMA 2/3, Gemma, Mistral, Code-Llama), while ALiBi is used in MPT, Falcon, and JINA models. Both methods significantly outperform the original sinusoidal approach, especially for long context windows.
What is RoPE?
RoPE (Rotary Position Embedding) is a method that encodes position information by rotating query and key vectors in the attention mechanism rather than adding position vectors to embeddings. It has become the most popular positional strategy for modern transformers.
Instead of adding a position vector to the word embedding, RoPE rotates the embedding by an angle that depends on its position. Think of it like a clock hand rotating as time (position) advances.
RoPE organizes the embedding dimensions as pairs (treating each pair as a 2D coordinate). For a 768-dim embedding, that's 384 pairs. Each pair gets rotated by a position-dependent angle.
For position pos and dimension pair i:
Recent advances allow RoPE to handle even longer contexts:
Used in: LLaMA 2/3, Gemma, Mistral, Code-LLaMA, PaLM, GPT-NeoX, and most modern open-source LLMs
What is ALiBi?
ALiBi (Attention with Linear Biases) is a simpler approach that doesn't add positional embeddings at all. Instead, it directly modifies the attention scores by adding a penalty proportional to the distance between tokens.
When computing attention between token at position i and token at position j, ALiBi adds a negative penalty based on their distance |i - j|:
where m is a slope parameter (different for each attention head)
For a sentence "The cat sat on the mat":
Result: Tokens far away get lower attention scores (larger penalty), while nearby tokens get higher attention.
Both methods achieve similar performance and far exceed older approaches like sinusoidal encoding.
Used in: MPT (MosaicML), Falcon (TII), JINA embeddings, and various research models
In real-world applications, positional encodings enable understanding of:
The AI understands this is a sequence of events in time order.
"that I ordered last week" modifies "product" - position helps track these relationships.
The word "Not" at the beginning flips the sentiment of the entire phrase.
Word order distinguishes questions from statements - critical for routing!
Bottom Line: Positional encodings are the secret ingredient that allows transformers to understand that language is not just a bag of words - it's a sequence where order carries meaning.
Positional encodings capture word order information in transformer models.
Without position information, transformers cannot distinguish "dog bites man" from "man bites dog."
Most modern language models use RoPE (Rotary Position Embedding) for improved long-context understanding.
We've seen what embeddings are and how they're used. But where do these magical numbers come from? How does the model learn that "cat" and "dog" should have similar embeddings?
Words that appear in similar contexts tend to have similar meanings. This is called the distributional hypothesis, and it's the foundation of embedding learning.
"cat" and "dog" appear with similar surrounding words: "The ___ is sleeping", "The ___ chased". The model learns to give them similar embeddings because they appear in similar contexts.
Understanding the principle is one thing — but how does the model actually adjust those 768 numbers? Here's the core mechanism.
Before training, each word gets a random vector of numbers. "cat" = [0.42, -0.13, 0.88, ...], "dog" = [-0.07, 0.91, -0.24, ...]. These numbers mean nothing yet — they're just starting points.
Neural networks are fundamentally vector transformation machines. They take input vectors, transform them through multiple layers, and produce output vectors.
Each layer multiplies the vector by a weight matrix, producing a new vector. The final vector represents the model's prediction. For next-token prediction, this output vector contains probabilities for each possible next word.
After training on billions of sentences, words that appear in similar contexts (like "cat" and "dog") naturally end up with similar embedding vectors — not because we told the model they're similar, but because the training process pushed them together to minimize prediction errors.
The mechanism above works for any prediction task. Different training approaches just change what the model is asked to predict:
Want to understand training in depth?
→ Chapter 5: Fine-tuning & Model Adaptation covers:
See embeddings in action through a practical semantic search example, then explore how production AI systems like RAG and AI agents use this technology.
You've seen how embeddings work in production—semantic search, RAG, AI agents. Now see what's improving behind the scenes: the same math you learned (tokenization, vector spaces, cosine similarity), just scaled to solve global business problems.
Understanding these challenges helps you recognize when existing solutions might struggle—and what emerging technologies might solve next.
BPE and WordPiece tokenizers split Hindi/Bengali text into 2-5× more tokens than English for the same content. This means Indian companies pay up to 5× more for API calls—same question, higher cost. A customer support system serving 10M users in India pays 3-5× more for identical semantic retrieval compared to English markets.
What's being built: Researchers are developing Universal Tokenizers with balanced vocabularies (250K+ tokens) that aim to reduce this cost inequality across languages.
Google DeepMind discovered that single-vector embeddings start losing accuracy after a certain document count: 512-dim fails around 500K documents, 1024-dim around 4M, 4096-dim around 250M. This isn't a bug—it's a mathematical ceiling.
What's being built: Multi-vector retrieval systems like ColBERT use multiple 128-dim vectors per document instead of one 1024-dim vector, preserving finer semantic detail at scale.
Traditional APIs return one static vector per chunk. When long documents split into 512-token chunks, critical context spanning boundaries gets lost. Example: "The company announced layoffs in Q3, but profits rose in Q4" might split across chunks, losing the causal relationship between events.
What's being built: Late Chunking techniques embed all tokens first (up to 32K), then chunk afterward—preserving cross-boundary context. Voyage-context-3 achieves 14% better retrieval accuracy using this approach, particularly beneficial for long legal and financial documents.
Modern AI needs to search across images, videos, audio, and text jointly. How do you create a vector space where "a photo of a cat" and the text "cat" cluster together? Models like jina-clip-v2 train text and image encoders together, creating a shared 512-dim space—same cosine similarity math, now spanning modalities.
What's being built: For edge devices (Raspberry Pi: 8GB RAM), Matryoshka Embeddings generate multiple resolutions (1024 → 128 dims) from one model—use lower dimensions on IoT, full precision in cloud. This matters for on-device search without cloud latency.
Embeddings convert text to vectors and measure similarity—but that's still just linear math. To recognize patterns, make predictions, and truly "learn," AI needs something more.
Learn why stacking layers matters and how activation functions transform simple calculations into intelligent systems.
Ready to test your understanding? Answer all questions correctly to unlock your achievement!