← Deep Learning & Modern AI Chapter 1

Embeddings: Words to Numbers

How AI captures the meaning of language

Embeddings: From Words to Numbers

The Translation Question

We've learned how neural networks process data using matrices and matrix multiplication. But here's the key insight: neural networks only understand numbers, while human language is made of words, sentences, and meaning.

Human Language

cat
democracy
happiness
artificial intelligence

Words carry meaning, context, relationships

?

How do we translate?

Neural Network Input

[0.234, -0.512, 0.891, 0.123, …]
[0.912, 0.445, -0.223, 0.667, …]
[0.556, -0.334, 0.789, -0.445, …]
[-0.111, 0.634, 0.478, -0.889, …]

Numbers that can be multiplied and transformed

The Central Question

How do we convert the word "cat" into a list of numbers that captures its meaning in a way that neural networks can understand? This is what embeddings solve.

What You'll Learn in This Chapter

We're going deep into embeddings - understanding not just what they are, but how they work, why they work, and how modern AI systems use them. This chapter covers:

1

Vector Spaces

How can geometry represent meaning?

2

Measuring Similarity

The mathematics of semantic closeness

3

Tokenization

Breaking text into processable pieces

4

The Embedding Matrix

Where the magic numbers live

5

Positional Encodings

Teaching models about word order

6

Training & Applications

How embeddings learn and what they enable

Vector Spaces: A Map of Meaning

Building on What You Already Know

Remember vector spaces from Chapter 5? We represented customers as vectors in 2D space:

Customer A = [6 months, 40 hours]
Customer B = [3 months, 10 hours]

Each dimension represented a measurable feature (subscription length, usage). We used vector math to find patterns and make predictions.

The Big Leap with Embeddings

Embeddings use the exact same mathematical principles, but instead of representing customer features, they represent word meanings. Instead of 2 dimensions (months, hours), we have 768+ dimensions that capture semantic relationships.

Let's see how this works...

From Customer Features to Word Meanings

Picture a map where every word has a specific location, and:

  • Words with similar meanings are placed close together
  • Words with different meanings are far apart
  • The distance and direction between words captures their relationship
Animals
cat
dog
kitten
puppy
Politics
democracy
vote
election
Close: cat � dog
Far: cat � democracy

This "meaning map" is exactly what a vector space is. But what actually makes it a "vector space"? Let's understand the mathematical properties that make this possible.

The Same Vector Space Rules Apply

You already learned the two fundamental rules of vector spaces in Chapter 5:

1. Vector Addition (Closure under addition)
Chapter 5 (Customer Features):
[6, 40] + [3, 10] = [9, 50]
Chapter 1 (Word Meanings):
embedding("king") + embedding("woman")
2. Scalar Multiplication (Closure under scaling)
Chapter 5 (Customer Features):
2 × [6, 40] = [12, 80]
Chapter 1 (Word Meanings):
0.5 × embedding("intense")

What's Different with Embeddings?

The mathematical rules are identical. What changes is what the vectors represent:

  • Customer vectors: Each dimension = one explicit feature (months, hours, logins)
  • Word embeddings: Each dimension = one learned semantic trait (formality, sentiment, abstractness)

Why This Matters for Language

Because embeddings live in a vector space with these same mathematical properties, AI models can:

  • Measure semantic similarity between words (using dot product and cosine similarity you learned in Chapter 5)
  • Combine word meanings algebraically: "king" - "man" + "woman" ≈ "queen"
  • Learn patterns through gradient descent (just like classification models)

From 2D to High-Dimensional Spaces

Our map visualization is 2-dimensional (left-right, up-down). But to capture the full richness of language, we need many more dimensions.

2D: 2 numbers

Can capture some relationships, but very limited

3D: 3 numbers

Better, but still insufficient for language

"""""
"""""
"""""
High-D: Hundreds to Thousands of numbers

What modern AI models actually use

Real Embedding Dimensions Across Models (2025)

Quick context: Embeddings are vectors (lists of numbers) that represent word meanings in AI models. The "dimension" is how many numbers are in each vector — more numbers means more capacity to capture subtle meanings.

Different AI models use different embedding dimensions. Larger dimensions can capture more nuanced meanings, but require more computational resources. Here's what major models use:

Cohere embed-v3
1,024
2023
Voyage AI voyage-2
1,024
2024
OpenAI text-embedding-3-small
1,536
2024
OpenAI text-embedding-3-large
3,072
2024
Google Gemini Embeddings
3,072
2024
Key Insights
✓ Smaller models (Cohere, Voyage): 1,024 dimensions — efficient and fast
✓ Mid-size models (OpenAI small): 1,536 dimensions — balanced performance
✓ Large models (OpenAI large, Gemini): 3,072 dimensions — capture subtle nuances

More dimensions = more capacity to capture subtle semantic nuances, but also more computational cost

📝 Note About Examples

Throughout this chapter, we'll use 1,024 dimensions in our examples because they're easier to visualize and understand. Just remember that production embedding models typically use anywhere from 1,024 to 3,072 dimensions depending on their use case!

Organizing a Music Library

Imagine organizing thousands of songs so similar ones are easy to find. You need to convert each song into numbers that capture its characteristics. Here's the challenge:

Understanding Multi-Dimensional Characteristics

Think about these three songs:

Fast Rock: High tempo, Rock genre, Electric guitars
Slow Rock: Low tempo, Rock genre, Acoustic guitars
Slow Classical: Low tempo, Classical genre, Orchestra

❓ Question: Can you represent each song with ONE number that captures similarity?

Why ONE Number Doesn't Work

Let's try placing each song on a number line (0.0 to 1.0):

Slow Classical: 0.2 (low tempo)
Slow Rock: ??? (low tempo like Classical, but rock genre like Fast Rock — where does it go?)
Fast Rock: 0.8 (high tempo)

⚠️ The Fundamental Limitation

Slow Rock is similar to Slow Classical in tempo (both slow),
BUT similar to Fast Rock in genre (both rock).

One number forces you to choose: sort by tempo OR genre, not both!

Songs vary across MANY independent dimensions: tempo, genre, mood, instruments, vocals, era, energy...
One number can't capture multiple independent characteristics simultaneously!

Use MULTIPLE Numbers (A Vector)

Instead of one number, represent each song as a vector — a list of numbers where each dimension captures a different characteristic:

Fast Rock Song: [0.89, 0.12, 0.78, 0.34, ... 764 more]
0.89 = High rock-ness
0.12 = Low classical-ness
0.78 = High tempo/energy
0.34 = Medium vocal intensity
• ... (768 dimensions total, each capturing different aspects)
Slow Rock Song: [0.91, 0.09, 0.32, 0.67, ... 764 more]
0.91 = High rock-ness ✓ (similar!)
0.09 = Low classical-ness ✓ (similar!)
0.32 = LOW tempo/energy ✗ (different!)
0.67 = High vocal intensity ✗ (different!)

Now the computer can calculate similarity! These two songs have similar values for dimensions 1 & 2 (both rock), but different values for dimensions 3 & 4 (tempo/vocals). The computer computes the mathematical distance between the full 768-dimensional vectors.

"Rock Ballad" (Slow, Guitar-heavy) [0.234, -0.512, 0.891, ...] 768 characteristics "Soft Rock" (Slow, Guitar-heavy) [0.241, -0.505, 0.887, ...] 768 characteristics "EDM Dance" (Fast, Electronic) [-0.678, 0.912, 0.234, ...] 768 characteristics Let's zoom into just the FIRST characteristic (Tempo: Slow → Fast): -1.0 Slow 0.0 Medium +1.0 Fast Rock Ballad 0.234 Soft Rock 0.241 Nearly identical! EDM Dance -0.678 Far apart ✓ Similar songs have similar numbers across all 768 characteristics!

The numbers directly represent musical properties. Both rock ballads have similar tempo values (0.234 and 0.241) — they're practically identical! The EDM track (-0.678) is clearly different. Now when you search for similar songs, the computer can use these numbers to find matches. This is exactly how embeddings work for words!

Why Continuous Numbers?

You've seen that embeddings are vectors: [0.234, -0.512, 0.891]. But why use decimal numbers? Why not simpler integers like [1, 2, 3]?

After all, [1, 2, 3] is cleaner, easier to store, and faster to process. What's wrong with discrete values?

Understanding Gradient-Based Learning: Why Discrete Numbers Don't Work

Imagine you're training a model. It predicts "cat" when the correct answer is "dog". The model needs to adjust the embedding for "cat" to make it closer to "dog".

With discrete numbers [1, 2, 3]:

Current embedding: [1, 2, 3]
Model asks: "Should I change 1 to 2? Or 2 to 3?"
Constraint: You can only jump. No in-between values.

The learning algorithm calculates a gradient — a direction that says "move this number up by 0.03" or "move that number down by 0.17". But with discrete numbers, you can't move by 0.03. You can only jump by 1.

The Limitation: The model can't make small corrections. Every change is a big leap. Learning either overshoots or gets stuck. Gradients require smooth, continuous values to function effectively.
How Continuous Numbers Enable Precise Adjustments

With continuous numbers, the model can move in tiny steps:

With continuous numbers [0.234, -0.512, 0.891]:

Iteration 1: [0.234, -0.512, 0.891]
→ Adjust by gradient: +0.003, -0.017, +0.005
Iteration 2: [0.237, -0.529, 0.896]
→ Adjust by gradient: +0.002, -0.011, +0.003
Iteration 3: [0.239, -0.540, 0.899]
→ Continue tiny adjustments...

Each adjustment is small and controlled. The model follows the gradient like a ball rolling downhill, gradually finding better representations. This is how neural networks learn — through millions of tiny gradient-based updates.

Analogy: Discrete numbers = climbing stairs (only specific heights). Continuous numbers = walking on a ramp (any height). Gradient descent needs the ramp.
The Mathematical Reason

Neural network training uses calculus. Specifically, it computes derivatives (rates of change). Derivatives only exist for continuous functions.

The gradient formula asks: "If I change x by an infinitesimal amount, how much does the error change?"

With discrete x: There is no infinitesimal change. Derivative is undefined or zero everywhere.
With continuous x: You can compute exact slopes. Gradient descent works.

This is why all learnable parameters in neural networks (weights, biases, embeddings) use floating-point numbers. Learning is optimization through gradients, and gradients require continuity.

Continuous numbers aren't a choice — they're a requirement for learning.
Without smooth values, there are no gradients. Without gradients, there is no training.

How AI Actually Learns Embeddings

You don't manually assign these numbers. Instead, AI trains on millions of examples (e.g., "users who liked Song A also liked Song B") and automatically learns what numbers to assign to make similar items end up close together in vector space.

About the Music Example:

This example uses interpretable dimensions ("tempo", "guitar-ness") for teaching purposes. Real embeddings have latent features learned by AI — you can't point to dimension #347 and say "this represents pluralization." The patterns are complex and distributed across many dimensions. The music analogy helps understand the concept, but actual embeddings are learned representations, not hand-crafted features.

Because these are numbers, not arbitrary labels, the computer can use mathematical formulas to calculate how similar two items are. Looking at [0.234, -0.512, 0.891] and [0.241, -0.505, 0.887], the computer can measure: "These numbers are almost the same, so these songs must be similar!"

We'll explore the exact mathematical formulas for measuring similarity in the next section!

Why Negative Numbers Are Essential

You might wonder: why do we need negative numbers? Why not just use 0 to 1? The answer: direction matters for capturing opposite meanings.

Imagine a "Temperature" Dimension:

-1.0 COLD 0.0 NEUTRAL +1.0 HOT frozen -0.82 cold -0.52 cool -0.23 room 0.03 warm 0.28 hot 0.67 scorching 0.91 Negative direction Positive direction Negative ↔ Positive captures OPPOSITE meanings This is why embeddings use values from -1 to +1, not just 0 to 1

The Three Zones:

  • Negative values (-1.0 to 0.0): Represent one side of a spectrum (cold, negative, past, etc.)
  • Zero (0.0): Neutral - this dimension doesn't apply to this word
  • Positive values (0.0 to +1.0): Represent the opposite side (hot, positive, future, etc.)

Real example: Across 768 dimensions, each dimension might capture a different semantic aspect: temperature, emotion, time, size, formality, concreteness, etc. Using negative and positive values lets the model represent rich, nuanced relationships.

Technical Note: Value Ranges

The -1 to +1 scale shown here is for illustration. In practice, raw embedding values are unbounded — they can be any positive or negative number. However, many models apply normalization (like L2 normalization) which scales the entire vector so its length equals 1, bringing individual values roughly into the -1 to +1 range. This normalization helps with:

  • Faster similarity computations (cosine similarity becomes a simple dot product)
  • Preventing some dimensions from dominating others
  • Better numerical stability during training

So while we use -1 to +1 for teaching, real embeddings before normalization can have values like -47.3 or +12.8!

What Does an Embedding Actually Look Like?

Now that we understand continuous vectors, let's look at real embeddings. An embedding is simply a list of continuous numbers - a vector:

Word: "cat" 768 dimensions
[
0.234, -0.512, 0.891, 0.123, -0.334, 0.667, -0.111, 0.445, … (760 more)
]

Each number typically between -1 and 1

Word: "dog" 768 dimensions
[
0.267, -0.489, 0.823, 0.156, -0.301, 0.634, -0.145, 0.478, … (760 more)
]

Notice: Similar numbers! Similar meanings � Similar vectors

Word: "democracy" 768 dimensions
[
-0.678, 0.912, 0.234, -0.445, 0.789, -0.223, 0.556, -0.889, … (760 more)
]

Completely different numbers = different meaning

🎮 Interactive Word Vector Space Explorer

Explore how words cluster in 2D vector space! Click words to see their relationships.

Dimension 1 (e.g., Animality)
Dimension 2 (e.g., Size)

Select Words to Add:

Click two words to see distance:

Select two words on the canvas

Observation:

Similar words (cat, dog, kitten) cluster together!

Different concepts (animals vs politics) stay far apart.

This is how AI understands meaning through geometry.

Measuring Similarity: The Mathematics

How Do We Measure "Close"?

We said "cat" and "dog" are "close" in vector space. But what does that mean mathematically? We need a precise way to measure similarity between vectors.

A Geometry Analogy

Imagine two people standing in a room. How do we measure if their positions are "similar"?

Option 1: Distance

How far apart are they? (Euclidean distance)

Option 2: Direction

Are they in the same direction from the origin? (Angle/Cosine)

Both work, but for embeddings we typically care about direction more than absolute distance.

Cosine Similarity: The Standard Metric

The most common way to measure similarity is cosine similarity. It measures the angle between two vectors.

Small angle = Similar words

Large angle = Different words

What You Need to Know

Cosine similarity gives you a score from -1 to 1:
1 = Vectors point same direction (very similar meanings)
0 = Vectors perpendicular (unrelated)
-1 = Vectors point opposite directions (opposite meanings)

The math handles the complexity — you just need to know higher scores = more similar!

🧮 Deep Dive (Optional): See the Math Formula Click to expand

The Cosine Similarity Formula

cos(θ) = (A ⋅ B) / (||A|| ⋅ ||B||)
A ⋅ B
Dot Product: Multiply corresponding elements and sum
= a1b1 + a2b2 + … + anbn
||A||
Magnitude of A: Length of vector A
= sqrt(a12 + a22 + … + an2)
||B||
Magnitude of B: Length of vector B
= sqrt(b12 + b22 + … + bn2)

Result range: -1 to 1

  • 1 = Identical direction (very similar)
  • 0 = Perpendicular (unrelated)
  • -1 = Opposite direction (opposite meaning)

🧮 Interactive Cosine Similarity Calculator

Calculate similarity step-by-step! Adjust vectors and watch the formula come to life.

Vector A:

A = [0.8, 0.4, 0.6]

Vector B:

B = [0.7, 0.5, 0.5]

Quick Examples:

Step-by-Step Calculation:

Step 1: Dot Product (A · B)
(0.8 × 0.7) + (0.4 × 0.5) + (0.6 × 0.5) = 0.56 + 0.20 + 0.30 = 0.86
Step 2: Magnitude of A (||A||)
√(0.8² + 0.4² + 0.6²) = √(0.64 + 0.16 + 0.36) = √1.16 = 1.077
Step 3: Magnitude of B (||B||)
√(0.7² + 0.5² + 0.5²) = √(0.49 + 0.25 + 0.25) = √0.99 = 0.995
Final: Cosine Similarity
cos(θ) = 0.86 / (1.077 × 0.995) = 0.86 / 1.072 = 0.802
Similarity Score:
-1
Opposite
0
Unrelated
1
Identical
0.802 - Very Similar!

Key Takeaway

Cosine similarity compares vector directions to measure semantic similarity.
When words have similar meanings, their embeddings point in similar directions, producing cosine scores close to 1.

Common Applications
Semantic Search: Find documents by meaning, not keywords. Customer asks "How do I get my money back?" — system finds "Refund Policy" even though words don't match.
RAG Retrieval: LLMs use cosine similarity to fetch relevant context from your knowledge base before generating answers.
AI Agent Tool Selection: When users request an action, agents use cosine similarity to pick the right tool from hundreds of options.

Cosine similarity is widely used in semantic search, retrieval systems, and recommendation engines across the AI industry.

From Text to Vectors: The Complete Pipeline

The Big Picture

From Text to Numbers: The Complete Journey

As we explored in Chapter 1, AI finds patterns in data through numerical processing. To leverage this capability for language, we transform text into vectors through a series of steps:

Text
"playing"
Tokens
["play", "ing"]
IDs
[42, 87]
Vectors
[[0.23,...], [0.15,...]]

Let's understand each step of this pipeline, starting with: How do you break text into pieces?

Step 1: Tokenization — Breaking Text Into Pieces

The Test Sentence

Let's use this sentence throughout to see how different approaches work:

"The players are replaying the game"

Three Approaches

Attempt 1
Make the Dictionary Like a Keyboard

The Idea: Just like a keyboard has ~100 keys (letters, numbers, punctuation) but you can type anything, what if our dictionary just stored individual characters? Then we could "spell out" any word!

Dictionary = { 'a', 'b', 'c', ..., 'z', 'A', 'B', ..., 'Z', ' ', '.', ... } → ~100 keys ⌨️
Sentence gets "typed" letter-by-letter:
['T', 'h', 'e', ' ', 'p', 'l', 'a', 'y', 'e', 'r', 's', ' ', 'a', 'r', 'e', ' ', 'r', 'e', 'p', 'l', 'a', 'y', 'i', 'n', 'g', ' ', 't', 'h', 'e', ' ', 'g', 'a', 'm', 'e']
→ 34 keystrokes!
Pro: Small dictionary (~100 keys), can spell anything
Con: Way too many "keystrokes", individual letters carry no meaning
Attempt 2
Store Every Word in the Language

The Idea: Instead of individual letters, what if we stored every complete word? "play", "playing", "player", "players", "replay", "replaying" — each gets its own spot in the dictionary.

Dictionary = { 'The', 'players', 'are', 'replaying', 'the', 'game', ..., 'play', 'playing', 'player', ... }
→ Need ~500,000 entries to cover English!
Our sentence: ['The', 'players', 'are', 'replaying', 'the', 'game']
→ Just 6 tokens! Much shorter.
Pro: Very short, preserves word meaning perfectly
Con: Dictionary must contain EVERY word in the language. Can't handle new words (like "ChatGPT")

Analogy: Like a keyboard with 500,000 keys — one for every word. Fast, but impractical and can't type new words!

Wait... What if There's a Better Way?

Attempt 1 was too small. Attempt 2 was too big. Let's look closely at some words and see if we notice anything interesting...

Let's Break Down Some Related Words
Word: playing
play + ing
Word: player
play + er
Word: replaying
re + play + ing
Word: players
play + er + s

Do you see it? The piece play keeps showing up!
Same with ing, er, re...

Instead of storing 500K whole words, what if we just stored these common pieces?

Attempt 3 ✓
Store Common Pieces (Subwords)
Dictionary = { 'The', 'play', 'er', 's', 'are', 're', 'ing', 'the', 'game' } → ~50K items
Sentence becomes: ['The', 'play', 'er', 's', 'are', 're', 'play', 'ing', 'the', 'game']
→ 10 tokens
Perfect Balance: Medium dictionary, reasonable token count, handles "play", "player", "players", "playing", "replay", "replaying" — ALL with the same pieces!

This is Byte-Pair Encoding (BPE)

BPE automatically discovers these frequent patterns by analyzing millions of sentences.
It's the same pattern recognition from Chapter 1 — but applied to finding reusable text pieces.

How BPE Actually Works

BPE automatically discovers these common pieces by analyzing text and repeatedly merging the most frequent adjacent character pairs.

The Algorithm (Simplified)
Start: p-l-a-y, p-l-a-y-i-n-g, p-l-a-y-e-d, p-l-a-y-e-r
Step 1: Find most frequent pair (p, l) → merge into pl
Step 2: Find most frequent pair (pl, a) → merge into pla
Step 3: Find most frequent pair (pla, y) → merge into play
...repeat until vocabulary reaches desired size

Result: A vocabulary of frequently-used pieces that can combine to form any word, including ones never seen during training.

Who decides "desired size"? The model designers choose vocabulary size during training based on tradeoffs: smaller vocabularies (32K-50K tokens) are faster but produce longer sequences, while larger ones (100K-256K tokens) are slower but handle multilingual text more efficiently. Most modern LLMs use 32K-50K for English-focused tasks or 100K-256K for multilingual support.

For the Curious: Complete Mathematical Formulation
1. Initialization

Start with a training corpus C and an initial vocabulary V₀ containing all unique characters:

V₀ = { c | c ∈ unique characters in C }

Example:
C = ["low", "lower", "newest", "widest"]
V₀ = { 'l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'd', 'i' }

Each word is split into individual characters: "low" → ['l', 'o', 'w']

2. Frequency Counting

For each iteration i, count all adjacent symbol pairs in the corpus:

freq(x, y) = Σ count of adjacent pair (x, y) across all words in C

Example (iteration 1):
Words: ['l','o','w'], ['l','o','w','e','r'], ['n','e','w','e','s','t'], ['w','i','d','e','s','t']

Pair frequencies:
freq('l', 'o') = 2 ← appears in "low" and "lower"
freq('o', 'w') = 2 ← appears in "low" and "lower"
freq('w', 'e') = 2 ← appears in "lower" and "newest"
freq('e', 's') = 2 ← appears in "newest" and "widest"
freq('s', 't') = 2 ← appears in "newest" and "widest"
freq('w', 'i') = 1
freq('i', 'd') = 1
freq('d', 'e') = 1
...
3. Merge Operation

Select the most frequent pair and merge it into a new symbol:

(x*, y*) = argmax(x,y) freq(x, y)

new_symbol = x* + y* (concatenation)
Vi+1 = Vi ∪ { new_symbol }

Replace all occurrences of (x*, y*) in corpus with new_symbol

Example:
Most frequent: ('e', 's') with frequency 2
Create new symbol: 'es'
V₁ = V₀ ∪ { 'es' }

Updated words:
['n','e','w','e','s','t'] → ['n','e','w','es','t']
['w','i','d','e','s','t'] → ['w','i','d','es','t']
4. Iteration

Repeat steps 2-3 until vocabulary reaches desired size |V| = k:

for i = 0 to k - |V₀| do:
  1. Count all adjacent pairs → freq(x, y) for all x, y
  2. Find most frequent: (x*, y*) = argmax freq(x, y)
  3. Create new symbol: s = x* + y*
  4. Add to vocabulary: Vi+1 = Vi ∪ { s }
  5. Replace all (x*, y*) with s in corpus
end for

Stopping condition:
|Vfinal| = k (typically k = 32,000 to 128,000)
5. Tokenization (Encoding New Text)

To tokenize a new word, apply learned merges in the same order:

Input: word w, learned merge operations M = [(x₁, y₁), (x₂, y₂), ..., (x_k, y_k)]

1. Split word into characters: w = [c₁, c₂, ..., c_n]
2. For each merge (x_i, y_i) in M (in order):
   Replace all adjacent (x_i, y_i) with (x_i + y_i)
3. Return final sequence of tokens

Example:
Word: "lowest"
Start: ['l', 'o', 'w', 'e', 's', 't']
Apply merge₁ ('e', 's') → 'es': ['l', 'o', 'w', 'es', 't']
Apply merge₂ ('es', 't') → 'est': ['l', 'o', 'w', 'est']
Apply merge₃ ('l', 'o') → 'lo': ['lo', 'w', 'est']
Apply merge₄ ('lo', 'w') → 'low': ['low', 'est']

Final tokens: ['low', 'est']
6. Computational Complexity
Training:
- Per iteration: O(N) to count pairs and apply merges
  where N = total number of symbols in corpus
- Total iterations: k - |V₀| (usually ~50K merges)
- Overall: O(k × N)

Tokenization (inference):
- Worst case: O(n²) where n = word length
- Typical: O(n × log k) with optimized data structures
Key Mathematical Properties
  • Greedy algorithm: Always merges most frequent pair (no backtracking)
  • Deterministic: Same corpus + same k → same vocabulary
  • Order-dependent: Merge sequence matters for tokenization
  • Lossless: Can always reconstruct original text from tokens
  • Vocabulary size control: Exact control via stopping at k merges
  • OOV handling: No "unknown" tokens (can fall back to characters)
Real Models Today
50K-100K
Typical vocabulary size
~2-3
Avg. pieces/word (English)
Billions
Possible word combinations

Modern LLMs like GPT-4, Claude, and others use vocabularies in this range — small enough to be efficient, large enough to be expressive.

Want to explore tokenizers yourself?

OpenAI's tiktoken — GPT models' BPE tokenizer (Python library)
Tiktokenizer — Interactive web tool to see how text gets tokenized
SentencePiece — Google's language-agnostic tokenizer library

Step 2: From Tokens to Vectors

Now that we have tokens ["play", "ing"], we need to convert them to numbers.

How Tokens Get Their Numbers

When the tokenizer is trained using BPE (as described above), each discovered token is assigned a sequential integer ID as it's added to the vocabulary:

During Training:
1. Start with individual characters → assign IDs 0-255
2. BPE finds "pl" is frequent → assign ID 256
3. BPE finds "ay" is frequent → assign ID 257
4. BPE merges to find "play" → assign ID 258
5. Continue until vocabulary reaches desired size

The ID number simply reflects when that token was added to the vocabulary during training. "play" might get ID 258 if it was the 259th token discovered (starting from 0).

Final Vocabulary:
"p" → 112
"l" → 108
"a" → 97
"y" → 121
"pl" → 256
"play" → 258
"ing" → 412
(32,000 tokens total)

Once trained, this vocabulary is fixed. When you type "play", the tokenizer looks it up and finds ID 258.

Stage 2A: ID Mapping

Each token gets a unique integer ID from the vocabulary:

"play" → ID: 42
"ing" → ID: 87
Stage 2B: Embeddings

As we saw with the music library analogy, a single number cannot capture semantic similarity. Instead, each token receives a vector of numbers (typically 256-4096 dimensions), where similar words naturally receive similar vectors.

"cat" → [0.82, -0.31, 0.67, ... 0.15] ← Animal, domestic, pet
"dog" → [0.79, -0.28, 0.71, ... 0.19] ← Similar!
"king" → [-0.15, 0.91, -0.42, ... 0.88] ← Very different!
How Embeddings Learn Meaning

Embeddings learn through the same pattern recognition process covered in Chapter 1 — the model adjusts embedding weights to minimize errors using gradient descent and loss functions (introduced in Chapter 2).

The Distributional Hypothesis: Words that appear in similar contexts have similar meanings.

If the model sees:
• "The cat sat on the mat"
• "The dog sat on the mat"

The training process pushes the embeddings for "cat" and "dog" closer together because they're used in similar ways. After seeing millions of sentences, embeddings naturally cluster by meaning.
The Learning Loop:
1. Model makes a prediction using current embeddings
2. Calculate loss — measure how wrong the prediction was
3. Compute gradients — which direction improves each embedding?
4. Update embeddings using gradient descent
5. Repeat millions of times across training data

Want to fine-tune pre-trained models for your specific needs? See Chapter 5: Fine-tuning & Model Adaptation.

The Embedding Matrix & How Lookup Works

The Storage Challenge

A model with 50,000 tokens, each represented by a 1,024-dimensional vector, needs to store and retrieve 51.2 million numbers. The model must access these embeddings thousands of times per second during inference.

The question: How do we organize these embeddings so the model can instantly retrieve the vector for any token ID?

The Embedding Matrix: Think of a Giant Spreadsheet

All token embeddings are stored in a single two-dimensional matrix called the embedding matrix. The simplest way to understand it: imagine a giant Excel spreadsheet.

The Spreadsheet Structure
Rows: Each row is one token (50,000 rows for a 50K vocabulary)
Columns: Each column is one dimension (1,024 columns for 1,024-dimensional embeddings)
Cells: Each cell contains one number from the embedding vector
Token ID │ Dim 0 Dim 1 Dim 2 Dim 3 ... Dim 1023
Row 42 │ 0.23 -0.87 0.45 0.34 ... 0.12 ← "play"
Row 87 │ 0.15 -0.23 0.34 -0.12 ... 0.08 ← "ing"
Row 512 │ 0.82 -0.31 0.67 0.21 ... 0.15 ← "cat"
...
Row 49999│ -0.45 0.67 -0.23 0.89 ... -0.34 ← last token

Just like finding a value in Excel: if you want the embedding for "cat" (token ID 512), jump directly to row 512 and grab all 1,024 numbers across that row.

Real-World Example: Processing "playing cats"
Step 1: Tokenizer splits text → ["play", "ing", "cat", "s"]
Step 2: Look up each token in vocabulary dictionary:
  • "play" → ID 42
  • "ing" → ID 87
  • "cat" → ID 512
  • "s" → ID 23
Step 3: Jump to those rows in the embedding matrix:
  • Row 42 → [0.23, -0.87, 0.45, ..., 0.12]
  • Row 87 → [0.15, -0.23, 0.34, ..., 0.08]
  • Row 512 → [0.82, -0.31, 0.67, ..., 0.15]
  • Row 23 → [0.11, 0.05, -0.12, ..., 0.03]
Step 4: Model processes these vectors through neural network layers
Why This Design Enables Real-Time Speed

The spreadsheet analogy reveals why LLMs are so fast. When you need the embedding for "cat" (token ID 512):

❌ Slow approach: Search through all 50,000 rows to find "cat"
✓ Fast approach: Jump directly to row 512

This is exactly like Excel's VLOOKUP — you don't scan every row, you jump directly to the row number. Whether the vocabulary has 50,000 tokens or 1 million tokens, accessing any row takes the same instant.

The Key Insight:

When ChatGPT processes your 50-word message, it performs 50 row lookups in this embedding matrix. With direct row access, this happens in milliseconds. If it had to search through rows sequentially, each lookup would slow down proportionally to vocabulary size — turning instant responses into multi-second delays.

🔍 Technical Deep Dive: The Mathematics of Embedding Lookup Optional - Click to expand
The Mathematical Equivalence

Theoretically, embedding lookup is a linear transformation. For vocabulary size V and embedding dimension d, we have an embedding matrix E ∈ ℝV×d.

To retrieve the embedding for token with index i, we could use a one-hot vector:

One-hot approach:
xonehot = [0, 0, ..., 1, ..., 0] ∈ ℝV (where xi = 1, all others = 0)

embedding = xonehot · E
embedding ∈ ℝd = E[i, :] (row i of matrix E)

When you multiply a one-hot vector by the embedding matrix, the result is mathematically equivalent to extracting the i-th row. This is because all terms vanish except where xi = 1.

Why Not Implement It This Way?

Forward pass computational cost:

One-hot multiplication: xonehot · E requires V × d multiplications
For V = 50,000 and d = 1,024:
51.2 million multiplications, where 51.199 million are "multiply by 0"

Memory overhead:

One-hot vector: 50,000 values (float32 = 200 KB per token)
Batch of 512 tokens with one-hot: ~102 MB
Batch of 512 token indices (int32): 2 KB
51,000× memory reduction

In production systems processing thousands of tokens per second, this memory overhead becomes prohibitive. NLP vocabularies can reach 100K-1M tokens, making one-hot encoding impractical.

The Efficient Implementation: Direct Indexing

Modern frameworks recognize this inefficiency and implement embedding lookup as direct array access:

Direct indexing:
embedding = E[i] # Direct row access, O(1) operation

# Framework implementations:
PyTorch: nn.Embedding(vocab_size, embed_dim)
TensorFlow: tf.nn.embedding_lookup(E, indices)
JAX: E[indices] # Pure array indexing

This operation has O(1) time complexity — constant time regardless of vocabulary size — and zero computational waste.

Backpropagation: Gradient Flow

During training, gradients flow back through the embedding layer. The key insight: only the accessed row receives gradients.

Forward: embedding = E[i]
Backward: ∂L/∂E[i] = ∂L/∂embedding

All other rows: ∂L/∂E[j] = 0 (for j ≠ i)

This sparse gradient structure is why embedding layers are efficient to train. In a batch of 512 tokens from a 50K vocabulary, only ~512 rows receive gradient updates (assuming unique tokens), while 49,488 rows require no computation.

Advanced frameworks like PyTorch implement sparse gradient updates to further optimize this — gradients are stored as sparse tensors, updating only the non-zero entries rather than the full matrix.

Practical Impact

The mathematical equivalence means embeddings are theoretically a linear transformation, but the implementation efficiency makes modern NLP possible. Without direct indexing:

  • GPT-4's inference would slow from milliseconds to minutes
  • Training runs would require 50× more GPU memory
  • Real-time translation and autocomplete would be impossible

Key Takeaway

The Core Insight

AI doesn't memorize every word.
Instead, it discovers frequent patterns (like "play", "ing", "er") and combines them like building blocks.
With ~100K token pieces, it can understand billions of word combinations.

The Complete Pipeline
Text

"playing cats" → Raw input text

Tokens

BPE segmentation: ["play", "ing", "cat", "s"] → Reusable pieces

IDs

Vocabulary lookup: [42, 87, 512, 23] → Unique identifiers

Embeddings

Matrix lookup: Dense vectors → Numbers that capture meaning

Vector Space

Semantic representation: Similar meanings cluster together

This is how AI bridges the gap between human language and mathematical computation.

Tokenization impacts model cost, speed, and fairness across languages. The embedding matrix enables fast, constant-time lookups. Together, they convert billions of word combinations into a space where AI can find patterns and understand meaning.

Positional Encodings: Teaching Models About Order

Position Matters in Language

We've built an amazing system: text becomes tokens, tokens become embeddings, embeddings capture meaning. But there's a critical problem we haven't addressed yet.

Consider these two sentences:

Sentence 1:

"The dog chased the cat"

Clear: dog is doing the chasing

Sentence 2:

"The cat chased the dog"

Clear: cat is doing the chasing

Wait... Stop and Think

Both sentences use the exact same tokens: ["the", "dog", "chased", "the", "cat"]. Just in different order.

If we look up embeddings for these tokens, we get:
• "the" → [0.12, -0.45, ...]
• "dog" → [0.79, -0.28, ...]
• "chased" → [0.34, 0.67, ...]
• "cat" → [0.82, -0.31, ...]

But these embeddings are the same regardless of where the word appears!
The embedding for "dog" is identical whether it's at position 1 or position 4.

So how does the model know "dog" comes before "chased" vs after?

Understanding the Position-Blind Nature of Embeddings

When we convert tokens to embeddings, we lose information about word order. The model receives a collection of meaning vectors but has no indication which came first, second, or third.

Input: "The dog chased the cat"
Embeddings: [[0.12, -0.45, ...], [0.79, -0.28, ...], [0.34, 0.67, ...], [0.12, -0.45, ...], [0.82, -0.31, ...]]

Input: "The cat chased the dog"
Embeddings: [[0.12, -0.45, ...], [0.82, -0.31, ...], [0.34, 0.67, ...], [0.12, -0.45, ...], [0.79, -0.28, ...]]

Without position information, the model can't distinguish subject from object!

This poses a significant challenge for language understanding. Word order determines who did what to whom, whether something happened in past or future, and countless other critical distinctions.

How Positional Encodings Address This

We need to add position information to each token's embedding. The approach: add a position-specific vector to each embedding that encodes "this token is at position 0", "this is at position 1", etc.

Now the same word gets different representations based on where it appears:

"dog" at position 1 = embedding_dog + position_encoding_1
"dog" at position 4 = embedding_dog + position_encoding_4

Result: Different final vectors → Model knows the position!

Adding Position Information

Here's the brilliant idea: Create special "position vectors" and add them to the word embeddings!

Word Embedding
+
Position Vector
=
Final Input (has BOTH meaning AND position!)

Let's Build This Step by Step

Step 1: What We Already Have

Let's say we have the sentence: "The dog sleeps"

Position 0: "The" → Embedding: [0.1, 0.5, 0.2, 0.8, ...]
Position 1: "dog" → Embedding: [0.6, 0.3, 0.7, 0.4, ...]
Position 2: "sleeps" → Embedding: [0.4, 0.9, 0.1, 0.6, ...]

Notice: The embedding for "dog" is the same whether it's at position 1, position 5, or position 100!

Step 2: Create Position Vectors

For each position (0, 1, 2, 3, ...), we create a unique "position vector" - a special pattern of numbers:

Position 0 vector: [0.0, 1.0, 0.0, 1.0, ...]
Position 1 vector: [0.8, 0.5, 0.8, 0.6, ...]
Position 2 vector: [0.9, -0.4, 0.9, -0.3, ...]

Each position gets its own unique pattern - like a barcode for that position!

Step 3: Add Them Together!

Now we simply add the word embedding and position vector together:

For "dog" at position 1:
Word embedding: [0.6, 0.3, 0.7, 0.4, ...]
Position vector: [0.8, 0.5, 0.8, 0.6, ...]
Final input: [1.4, 0.8, 1.5, 1.0, ...]

✓ Success! This final vector now contains BOTH the meaning of "dog" AND the fact that it's at position 1!

The Key Insight

The same word at different positions gets different final vectors!
This is exactly what we need to solve the word order problem.

Sentence 1: "The dog bites the man"
Position 1: "dog" → Final input: [1.4, 0.8, 1.5, 1.0, ...]
Position 4: "man" → Final input: [0.5, 1.3, 0.2, 1.8, ...]
Sentence 2: "The man bites the dog"
Position 1: "man" → Final input: [0.9, 1.1, 0.7, 1.6, ...]
Position 4: "dog" → Final input: [1.0, 0.9, 1.2, 1.4, ...]

✓ Notice: "dog" at position 1 has different numbers than "dog" at position 4!
The AI can now tell the difference between "dog bites man" and "man bites dog"

But How Do We Create These Position Vectors?

We need a way to encode position information. There are different approaches used by different models:

❌ Simple counting (1, 2, 3...): Numbers grow unbounded, could mess up embeddings
✓ Sine and cosine patterns (Original Transformer): Add position vectors using sin/cos, stay between -1 and 1
✓ Learned vectors (BERT, GPT): Let the AI learn the best position vectors during training
✓ RoPE - Rotary Position Embedding (LLaMA, most modern LLMs): Instead of adding position info, rotate the embedding vectors by an angle determined by their position. More mathematically elegant and works better for long sequences!

Note: The "add position vector" approach we explained is conceptually simpler and used by the original Transformer. Modern models like LLaMA use rotation-based methods (RoPE) for better performance, but the core idea remains: inject position information so the model knows word order!

Evolution of Positional Encoding Methods

Different approaches to encoding position information have emerged over time, each with tradeoffs:

Era Method Context Limit Still used?
2017–2019 Sinusoidal ~2k tokens mostly obsolete
2019–2021 Learned Absolute ~4k tokens limited use
2021–2023 RoPE / ALiBi up to 128k+ standard today
2024–2025 xPos, CARoPE, TAPE up to 1M+ tokens 🔥 research frontier

Current State (2025): RoPE is the default positional strategy in most modern LLMs (LLaMA 2/3, Gemma, Mistral, Code-Llama), while ALiBi is used in MPT, Falcon, and JINA models. Both methods significantly outperform the original sinusoidal approach, especially for long context windows.

🔄 RoPE: Rotary Position Embedding (LLaMA, Mistral, Gemma) Click to expand

What is RoPE?
RoPE (Rotary Position Embedding) is a method that encodes position information by rotating query and key vectors in the attention mechanism rather than adding position vectors to embeddings. It has become the most popular positional strategy for modern transformers.

How It Works

The Core Idea:

Instead of adding a position vector to the word embedding, RoPE rotates the embedding by an angle that depends on its position. Think of it like a clock hand rotating as time (position) advances.

Token at position 0: Rotate by 0°
Token at position 1: Rotate by θ°
Token at position 2: Rotate by 2θ°
Token at position n: Rotate by n×θ°
Rotation in 2D Planes:

RoPE organizes the embedding dimensions as pairs (treating each pair as a 2D coordinate). For a 768-dim embedding, that's 384 pairs. Each pair gets rotated by a position-dependent angle.

Example: Dimensions [0,1] form one 2D plane, [2,3] another, and so on. The rotation angle decreases for higher dimension pairs, creating a multi-frequency encoding.
The Math (Simplified):

For position pos and dimension pair i:

θᵢ = 10000^(-2i/d)
Rotation matrix rotates by pos × θᵢ

Why RoPE Works Better

  • Relative position encoding: The attention score between two tokens automatically depends on their relative distance, not absolute positions
  • No explicit position limit: Works for any sequence length without retraining (with proper scaling)
  • Better extrapolation: Maintains performance when given sequences longer than training length
  • Efficient: No extra parameters to learn - just rotation operations
  • Mathematically elegant: Decaying inter-token dependency with increasing distance emerges naturally

Context Extension Techniques (2024-2025)

Recent advances allow RoPE to handle even longer contexts:

NTK-Aware Scaling: Adjusts rotation frequencies to handle longer sequences (used in Code-LLaMA for 16K+ tokens)
YaRN (Yet another RoPE extensioN): Keeps angles stable for 256K+ tokens through dynamic interpolation
ComRoPE (CVPR 2025): Adds trainable parameters for improved scalability and robustness

Used in: LLaMA 2/3, Gemma, Mistral, Code-LLaMA, PaLM, GPT-NeoX, and most modern open-source LLMs

📐 ALiBi: Attention with Linear Biases (MPT, Falcon, JINA) Click to expand

What is ALiBi?
ALiBi (Attention with Linear Biases) is a simpler approach that doesn't add positional embeddings at all. Instead, it directly modifies the attention scores by adding a penalty proportional to the distance between tokens.

How It Works

The Core Idea:

When computing attention between token at position i and token at position j, ALiBi adds a negative penalty based on their distance |i - j|:

attention_score = query · key - m × |i - j|

where m is a slope parameter (different for each attention head)

Concrete Example:

For a sentence "The cat sat on the mat":

Attention from "cat" (pos 1) to:
→ "The" (pos 0): penalty = -m × |1-0| = -m × 1
→ "cat" (pos 1): penalty = -m × |1-1| = 0
→ "sat" (pos 2): penalty = -m × |1-2| = -m × 1
→ "mat" (pos 5): penalty = -m × |1-5| = -m × 4

Result: Tokens far away get lower attention scores (larger penalty), while nearby tokens get higher attention.

Why ALiBi Works Better

  • No position embeddings needed: Simpler architecture - word embeddings go directly into the model
  • Excellent extrapolation: Models trained on 1K tokens can handle 2K+ tokens at inference with no degradation
  • Training efficiency: 11% faster training and 11% less memory than sinusoidal methods
  • Length independence: Works for any sequence length without special techniques
  • Computational efficiency: Simple subtraction operation, hardware-friendly

ALiBi vs RoPE: Key Differences

ALiBi advantages: Simpler implementation, faster training, better computational efficiency
RoPE advantages: More widely adopted, mathematically elegant, proven at massive scale

Both methods achieve similar performance and far exceed older approaches like sinusoidal encoding.

Used in: MPT (MosaicML), Falcon (TII), JINA embeddings, and various research models

Why This Matters

In real-world applications, positional encodings enable understanding of:

Temporal Order
"First I called, then I emailed, finally I gave up"

The AI understands this is a sequence of events in time order.

Dependency Structure
"The product that I ordered last week arrived damaged"

"that I ordered last week" modifies "product" - position helps track these relationships.

Negation Scope
"Not satisfied with the resolution"

The word "Not" at the beginning flips the sentiment of the entire phrase.

Question vs Statement
"Can I get a refund?" vs "I can get a refund"

Word order distinguishes questions from statements - critical for routing!

Bottom Line: Positional encodings are the secret ingredient that allows transformers to understand that language is not just a bag of words - it's a sequence where order carries meaning.

Key Takeaway

Positional encodings capture word order information in transformer models.
Without position information, transformers cannot distinguish "dog bites man" from "man bites dog."
Most modern language models use RoPE (Rotary Position Embedding) for improved long-context understanding.

Training & Applications: From Learning to Deployment

How Do Embeddings Learn Meaning?

We've seen what embeddings are and how they're used. But where do these magical numbers come from? How does the model learn that "cat" and "dog" should have similar embeddings?

The Core Learning Principle

"You shall know a word by the company it keeps" - J.R. Firth (1957)

Words that appear in similar contexts tend to have similar meanings. This is called the distributional hypothesis, and it's the foundation of embedding learning.

Learning from Context

"The cat is sleeping on the couch"
"The dog is sleeping on the couch"
"The cat chased a mouse"
"The dog chased a ball"

"cat" and "dog" appear with similar surrounding words: "The ___ is sleeping", "The ___ chased". The model learns to give them similar embeddings because they appear in similar contexts.

The Training Mechanism: How Embeddings Actually Change

Understanding the principle is one thing — but how does the model actually adjust those 768 numbers? Here's the core mechanism.

Step 1: Embeddings Start Random

Before training, each word gets a random vector of numbers. "cat" = [0.42, -0.13, 0.88, ...], "dog" = [-0.07, 0.91, -0.24, ...]. These numbers mean nothing yet — they're just starting points.

Step 2: Neural Networks Transform Vectors

Neural networks are fundamentally vector transformation machines. They take input vectors, transform them through multiple layers, and produce output vectors.

Input embedding (vector)
  ↓ × weight matrix
Hidden layer 1 (new vector)
  ↓ × weight matrix
Hidden layer 2 (new vector)
  ↓ × weight matrix
Output prediction (vector)

Each layer multiplies the vector by a weight matrix, producing a new vector. The final vector represents the model's prediction. For next-token prediction, this output vector contains probabilities for each possible next word.

Step 3: The Learning Loop

1. Make a Prediction
Input: "The cat is sleeping on the ___"
Model predicts: "couch" (80% confident)
2. Measure the Error
Actual next word: "couch" ✓
Error = small (prediction was correct!)
3. Backpropagation
Calculate: "How should we change the weights (including embeddings) to reduce this error?"
The math works backward through all layers, computing gradients.
4. Update the Embeddings
Adjust the embedding values slightly:
"cat" embedding: [0.42, -0.13, 0.88, ...] → [0.43, -0.12, 0.89, ...]
Tiny changes, but repeated millions of times across billions of examples.

The Result

After training on billions of sentences, words that appear in similar contexts (like "cat" and "dog") naturally end up with similar embedding vectors — not because we told the model they're similar, but because the training process pushed them together to minimize prediction errors.

Three Ways to Set Up the Learning Task

The mechanism above works for any prediction task. Different training approaches just change what the model is asked to predict:

1. Next Token Prediction
Given "The customer wants a ___", predict "refund".
GPT, LLaMA, Claude
2. Masked Language Modeling
Given "The customer wants a [MASK]", predict the masked word using context before and after.
BERT, RoBERTa
3. Contrastive Learning
Train so similar pairs ("Product defective" ↔ "Item broken") are close, while different pairs are far apart.
Sentence-BERT, text-embedding-3

Want to understand training in depth?
Chapter 5: Fine-tuning & Model Adaptation covers:

  • How gradient descent optimization works
  • Learning rates, batch sizes, and convergence
  • Fine-tuning pretrained embeddings for specific domains
  • When to freeze vs. update embedding layers
  • Loss functions and evaluation metrics

Real-World Applications

See embeddings in action through a practical semantic search example, then explore how production AI systems like RAG and AI agents use this technology.

Example: Semantic Search in Action

Keyword Search Limitations
Customer asks: "My screen won't turn on after I press the power button"
Keyword search for "screen" + "turn on" + "power button":
• Returns 87 articles
• Includes irrelevant results like "screen reader" and "keyboard shortcuts"
• Right answer buried at position #47
Agent spends 3-5 minutes searching
How Semantic Search Captures Meaning
AI converts question to embedding, finds similar articles in under 1 second:
Top 5 results (all relevant!):
  1. "Display not responding to power" (0.94) ← Perfect match!
  2. "Monitor won't start up" (0.91)
  3. "Black screen after powering on" (0.88)
  4. "No display when pressing power button" (0.85)
  5. "Screen stays dark on boot" (0.82)
Different words, same meaning — embeddings capture semantic similarity!

How Production AI Systems Use Embeddings

RAG: Retrieval-Augmented Generation
LLMs don't know your company's private data. RAG uses embeddings to find relevant documents, then sends them to the LLM as context.
The basic flow:
1. Convert all docs → embeddings → store in vector database
2. User asks question → convert to embedding
3. Find top 5 most similar docs (cosine similarity)
4. Send question + retrieved docs → LLM generates grounded answer
→ Chapter 7: RAG (Retrieval-Augmented Generation)
Covers chunking strategies, vector database selection, reranking, hybrid search, handling citations, evaluating retrieval quality, and production deployment patterns
AI Agents: Tool Selection
Modern AI agents have 100s of tools/APIs. How do they pick the right one? Embeddings.
The basic flow:
1. Each tool description → embedding
2. User request → embedding
3. Find top 5 most relevant tools (cosine similarity)
4. Agent picks best tool from those 5 (not all 500!)
→ Chapter 8: AI Agents & Tool Use
Covers agent architectures (ReAct, Plan-and-Execute), tool calling protocols, multi-step reasoning, memory management, error recovery, and building production agents
Other Common Applications
Recommendations: "Customers who liked X also liked Y" — embeddings cluster similar products
Duplicate Detection: Find tickets/documents with same meaning but different words
Classification: Route items to right team by embedding similarity
Clustering: Auto-discover topics in customer feedback without manual tagging
The Core Pattern:
Convert items (docs, tools, products) → embeddings → find similar items using cosine similarity
Every modern AI deployment — RAG, agents, recommendations, search — uses this one pattern.

What's Next

What's Being Solved Right Now

You've seen how embeddings work in production—semantic search, RAG, AI agents. Now see what's improving behind the scenes: the same math you learned (tokenization, vector spaces, cosine similarity), just scaled to solve global business problems.

Understanding these challenges helps you recognize when existing solutions might struggle—and what emerging technologies might solve next.

Cost Inequality Across Languages

Impact: API costs, regional fairness

BPE and WordPiece tokenizers split Hindi/Bengali text into 2-5× more tokens than English for the same content. This means Indian companies pay up to 5× more for API calls—same question, higher cost. A customer support system serving 10M users in India pays 3-5× more for identical semantic retrieval compared to English markets.

What's being built: Researchers are developing Universal Tokenizers with balanced vocabularies (250K+ tokens) that aim to reduce this cost inequality across languages.

Retrieval Accuracy at Billion-Document Scale

Impact: Enterprise search quality, knowledge base size limits

Google DeepMind discovered that single-vector embeddings start losing accuracy after a certain document count: 512-dim fails around 500K documents, 1024-dim around 4M, 4096-dim around 250M. This isn't a bug—it's a mathematical ceiling.

What's being built: Multi-vector retrieval systems like ColBERT use multiple 128-dim vectors per document instead of one 1024-dim vector, preserving finer semantic detail at scale.

Preserving Context Across Chunk Boundaries

Impact: Long-document retrieval accuracy, legal/financial document search

Traditional APIs return one static vector per chunk. When long documents split into 512-token chunks, critical context spanning boundaries gets lost. Example: "The company announced layoffs in Q3, but profits rose in Q4" might split across chunks, losing the causal relationship between events.

What's being built: Late Chunking techniques embed all tokens first (up to 32K), then chunk afterward—preserving cross-boundary context. Voyage-context-3 achieves 14% better retrieval accuracy using this approach, particularly beneficial for long legal and financial documents.

Multimodal Search & Edge Deployment

Impact: Visual search, IoT/edge AI deployments

Modern AI needs to search across images, videos, audio, and text jointly. How do you create a vector space where "a photo of a cat" and the text "cat" cluster together? Models like jina-clip-v2 train text and image encoders together, creating a shared 512-dim space—same cosine similarity math, now spanning modalities.

What's being built: For edge devices (Raspberry Pi: 8GB RAM), Matryoshka Embeddings generate multiple resolutions (1024 → 128 dims) from one model—use lower dimensions on IoT, full precision in cloud. This matters for on-device search without cloud latency.

COMING NEXT

Chapter 2: Non-Linearity

Embeddings convert text to vectors and measure similarity—but that's still just linear math. To recognize patterns, make predictions, and truly "learn," AI needs something more.

Learn why stacking layers matters and how activation functions transform simple calculations into intelligent systems.

Test Your Embedding Knowledge

Ready to test your understanding? Answer all questions correctly to unlock your achievement!

1. What is an embedding in the context of neural networks?

2. What does cosine similarity measure?

3. Why is BPE (Byte-Pair Encoding) preferred over word-level tokenization?

4. How does one-hot encoding × embedding matrix work mathematically?

5. Why do we need positional encodings in transformers?

6. What is the key advantage of learned embeddings over one-hot encoding?

7. How do embeddings learn that "cat" and "dog" should be similar?