Embeddings: Words to Numbers

How AI captures the meaning of language

Embeddings: From Words to Numbers

The Translation Question

We've learned how neural networks process data using matrices and matrix multiplication. But here's the key insight: neural networks only understand numbers, while human language is made of words, sentences, and meaning.

Human Language

cat

democracy

happiness

artificial intelligence

Words carry meaning, context, relationships

How do we translate?

Neural Network Input

[0.234, -0.512, 0.891, 0.123, …]

[0.912, 0.445, -0.223, 0.667, …]

[0.556, -0.334, 0.789, -0.445, …]

[-0.111, 0.634, 0.478, -0.889, …]

Numbers that can be multiplied and transformed

The Central Question

How do we convert the word "cat" into a list of numbers that captures its meaning in a way that neural networks can understand? This is what embeddings solve.

What You'll Learn in This Chapter

We're going deep into embeddings - understanding not just what they are, but how they work, why they work, and how modern AI systems use them. This chapter covers:

Vector Spaces

How can geometry represent meaning?

Measuring Similarity

The mathematics of semantic closeness

Tokenization

Breaking text into processable pieces

The Embedding Matrix

Where the magic numbers live

Positional Encodings

Teaching models about word order

Training & Applications

How embeddings learn and what they enable

Vector Spaces: A Map of Meaning

Imagine a Map Where Words Have Locations

Picture a map where every word has a specific location, and:

Words with similar meanings are placed close together
Words with different meanings are far apart
The distance and direction between words captures their relationship

Animals

cat

dog

kitten

puppy

Politics

democracy

vote

election

Close: cat � dog

Far: cat � democracy

This "meaning map" is exactly what a vector space is. Each word becomes a point in a high-dimensional space where geometry - distances and angles - encodes semantic relationships.

From 2D to High-Dimensional Spaces

Our map visualization is 2-dimensional (left-right, up-down). But to capture the full richness of language, we need many more dimensions.

2D: 2 numbers

Can capture some relationships, but very limited

�

3D: 3 numbers

Better, but still insufficient for language

�

"""""
"""""
"""""

768D: 768 numbers

What modern models actually use

Real Embedding Dimensions (2025)

BERT

768

LLaMA 3

4,096

Claude 3.7

~10,000

GPT-3

12,288

More dimensions = more capacity to capture subtle semantic nuances

A Story: Organizing a Music Library

Imagine you have thousands of songs, and you want to organize them so that similar songs are easy to find. You need to convert each song into numbers that capture its essence. What kind of numbers should you use?

🎵 First Attempt: Simple ID Numbers

You decide to give each song a unique ID: Song #1, Song #2, Song #3, and so on. Let's see how this works for three songs:

The Insight: When you search for songs similar to "Rock Ballad" (#4,532), the computer has no way to know that "Soft Rock" (#12,445) is related. The ID numbers are just arbitrary labels — they don't capture the musical similarity.

🎯 Better Approach: Continuous Values

Instead of arbitrary IDs, what if each song got a list of numbers representing its characteristics? Slow songs get similar "tempo" numbers, guitar-heavy songs get similar "instrument" numbers, and so on.

Why This Works: The numbers directly represent musical properties. Both rock ballads have similar tempo values (0.234 and 0.241) — they're practically identical! The EDM track (-0.678) is clearly different. Now when you search for similar songs, the computer can use these numbers to find matches. This is exactly how embeddings work for words!

Why Negative Numbers Are Essential

You might wonder: why do we need negative numbers? Why not just use 0 to 1? The answer: direction matters for capturing opposite meanings.

Imagine a "Temperature" Dimension:

The Three Zones:

Negative values (-1.0 to 0.0): Represent one side of a spectrum (cold, negative, past, etc.)
Zero (0.0): Neutral - this dimension doesn't apply to this word
Positive values (0.0 to +1.0): Represent the opposite side (hot, positive, future, etc.)

Real example: Across 768 dimensions, each dimension might capture a different semantic aspect: temperature, emotion, time, size, formality, concreteness, etc. Using negative and positive values lets the model represent rich, nuanced relationships.

Summary: Continuous Vectors

📊

Continuous = Smooth

Values like 0.234, 0.267, 0.241 can be very close, capturing subtle similarities

↔️

Negatives = Direction

-0.8 vs +0.8 captures opposite meanings along each dimension

🎯

768 Dimensions = Rich

Each dimension captures a different semantic aspect, combining to represent complex meaning

What Does an Embedding Actually Look Like?

Now that we understand continuous vectors, let's look at real embeddings. An embedding is simply a list of continuous numbers - a vector:

Word: "cat" 768 dimensions

[

0.234, -0.512, 0.891, 0.123, -0.334, 0.667, -0.111, 0.445, … (760 more)

]

Each number typically between -1 and 1

Word: "dog" 768 dimensions

[

0.267, -0.489, 0.823, 0.156, -0.301, 0.634, -0.145, 0.478, … (760 more)

]

Notice: Similar numbers! Similar meanings � Similar vectors

Word: "democracy" 768 dimensions

[

-0.678, 0.912, 0.234, -0.445, 0.789, -0.223, 0.556, -0.889, … (760 more)

]

Completely different numbers = different meaning

The Learned Property

"cat" and "dog" have similar numbers because these words appear in similar contexts during training. "democracy" has very different numbers because it appears in completely different contexts. This isn't programmed - it emerges from learning!

Measuring Similarity: The Mathematics

How Do We Measure "Close"?

We said "cat" and "dog" are "close" in vector space. But what does that mean mathematically? We need a precise way to measure similarity between vectors.

A Geometry Analogy

Imagine two people standing in a room. How do we measure if their positions are "similar"?

Option 1: Distance

How far apart are they? (Euclidean distance)

Option 2: Direction

Are they in the same direction from the origin? (Angle/Cosine)

Both work, but for embeddings we typically care about direction more than absolute distance.

Cosine Similarity: The Standard Metric

The most common way to measure similarity is cosine similarity. It measures the angle between two vectors.

Small angle = Similar words

Large angle = Different words

The Cosine Similarity Formula

cos(θ) = (A ⋅ B) / (||A|| ⋅ ||B||)

A ⋅ B

Dot Product: Multiply corresponding elements and sum
= a₁b₁ + a₂b₂ + … + a_nb_n

||A||

Magnitude of A: Length of vector A
= sqrt(a₁² + a₂² + … + a_n²)

||B||

Magnitude of B: Length of vector B
= sqrt(b₁² + b₂² + … + b_n²)

Result range: -1 to 1

1 = Identical direction (very similar)
0 = Perpendicular (unrelated)
-1 = Opposite direction (opposite meaning)

Tokenization: Breaking Language into Pieces

Understanding Text Structure

Imagine an AI system that processes thousands of messages daily. It needs to understand that "refund", "re-fund", and "refunding" are related. But how does AI break down text into processable pieces?

This is the tokenization problem: converting raw text into discrete units that can be embedded into vectors.

Input Text

"The customer needs a refund immediately"

→

Tokens

The customer needs a refund immediately

But is splitting by spaces always the answer? Modern AI uses more sophisticated approaches.

Three Approaches to Tokenization

Character-Level

Text: "refund"

r e f u n d

6 tokens

Pros: Small vocabulary (~100 chars), handles any text

Cons: Very long sequences, loses word meaning

Word-Level

Text: "refund"

refund

1 token

Pros: Preserves word meaning, shorter sequences

Cons: Huge vocabulary (100K+ words), can't handle typos or new words

3Subword-Level (BPE/WordPiece)2025 Standard
Text: "refunding"
 refund ##ing 
2 tokens
 Pros: Balanced vocabulary (~30K tokens), handles rare words, captures morphology
 Cons: Slightly more complex algorithm

BPE Algorithm: The Mathematics

Byte-Pair Encoding (BPE) is the most popular tokenization method in 2025. It learns an optimal vocabulary by iteratively merging the most frequent character pairs. Let's see the exact algorithm:

BPE Training Algorithm

Step 1

Initialize with Character Vocabulary

Start with all unique characters in your training data:

Vocabulary = {a, b, c, …, z, space, punctuation}
                Size ≈ 100 tokens

Step 2

Count All Adjacent Pairs

For corpus: "refund refund refunding fund funding"

Pair counts:
                ("r", "e"): 3 times
                ("e", "f"): 3 times
                ("f", "u"): 4 times
                ("u", "n"): 4 times
                ("n", "d"): 5 times  ← Most frequent!

Step 3

Merge Most Frequent Pair

Add merged pair to vocabulary:

Vocabulary += {"nd"}
                Replace all "n" + "d" → "nd" in corpus

Step 4

Repeat Until Target Vocabulary Size

Continue merging until vocabulary reaches desired size (typically 32,000-50,000 tokens)

Merge 1: "nd"

Merge 2: "fu"

Merge 3: "fund"

Merge 4: "refund"

… continues …

Complete Example: Tokenizing "refunding"

Initial (character-level):

r e f u n d i n g

↓ Apply learned merges

After merge "nd":

r e f u nd i n g

↓ Continue merging

After merge "refund":

refund i n g

↓ Final merge "ing"

Final tokens:

refund ##ing

The Merge Selection Formula

At each iteration, BPE selects the pair (x, y) that maximizes frequency:

merge = argmax _(x,y) count ( x, y )

This greedy approach ensures the most common patterns are captured first, leading to efficient tokenization of frequent words and morphemes.

Why Subword Tokenization Matters

Scenario: Handling Product Names & Technical Terms

AI systems often need to understand messages about products like "SuperWidget-3000" or technical terms like "API integration". Subword tokenization excels here:

Word-level (fails):

[UNK]

Unknown token - meaning lost!

Subword-level (works):

Super Widget - 3000

Meaning preserved through components!

2025 Industry Standards

32,768

Tokens in GPT-4

128,000

Tokens in Claude 3.7

~2.5

Avg tokens per word

The Embedding Matrix: Where Meaning Lives

From Tokens to Vectors: The Lookup Table

We've tokenized our text into discrete units. Now we need to convert each token into a vector of numbers. This is where the embedding matrix comes in - it's essentially a giant lookup table that maps token IDs to vectors.

Once the AI has processed "refund" into token ID 5432, it needs the vector representation. Think of the embedding matrix like a phonebook: you look up a name (token ID) and get back contact information (the embedding vector).

The Embedding Matrix Structure

Embedding Matrix E (vocab_size × embedding_dim)

[

Token 0: [0.12, -0.45, 0.89, …, 0.34] ← 768 numbers

Token 1: [0.56, 0.23, -0.67, …, -0.12] ← 768 numbers

Token 2: [-0.34, 0.78, 0.45, …, 0.91] ← 768 numbers

⋮

Token 5432: [0.91, -0.23, 0.56, …, -0.45] ← "refund" embedding

⋮

Token 32767: [-0.67, 0.45, -0.12, …, 0.78] ← 768 numbers

]

Shape: (32,768 tokens × 768 dimensions) = 25,165,824 parameters!

The Lookup Operation

Given a token ID, we simply retrieve the corresponding row from the embedding matrix:

embedding = E [ token_id ]

Example: E[5432] returns the 5432nd row, which is the embedding for "refund"

One-Hot Encoding: The Mathematical Foundation

Before we dive deeper into embeddings, we need to understand one-hot encoding - the mathematical representation of categorical data that makes the lookup operation work.

What is One-Hot Encoding?

One-hot encoding represents a token as a vector with all zeros except for a single 1 at the position corresponding to that token's ID.

Vocabulary: {"the", "cat", "sat", "on", "mat"} 5 tokens total

"the" (ID: 0)

1 0 0 0 0

"cat" (ID: 1)

0 1 0 0 0

"sat" (ID: 2)

0 0 1 0 0

"on" (ID: 3)

0 0 0 1 0

"mat" (ID: 4)

0 0 0 0 1

Key Properties

✓

Sparse: Only one non-zero element out of vocab_size elements

✓

Orthogonal: Dot product between any two one-hot vectors is 0

✗

No Semantic Information: "cat" and "dog" are equally distant

The Mathematics: Embedding as Matrix Multiplication

Here's the key insight: Looking up an embedding is mathematically equivalent to multiplying the one-hot vector by the embedding matrix!

The Equivalence

One-Hot Vector for Token ID = 2

x = [0, 0, 1, 0, 0]

Shape: (1 × vocab_size) = (1 × 5)

Embedding Matrix (5 tokens × 3 dimensions)

E = [[0.1, 0.5, 0.9] [0.2, 0.6, 0.8] [0.3, 0.7, 0.4] [0.4, 0.8, 0.5] [0.5, 0.9, 0.2]]

Shape: (vocab_size × embedding_dim) = (5 × 3)

Matrix Multiplication: x · E

result = x \cdot E = [0, 0, 1, 0, 0] \times E

Step-by-Step Calculation:

= (0 × row₀) + (0 × row₁) + (1 × row₂) + (0 × row₃) + (0 × row₄)

= 0 + 0 + row₂ + 0 + 0

= [0.3, 0.7, 0.4]

The one-hot vector with 1 at position 2 selects the 2nd row of E!

The Embedding Formula

embedding(token) = one_hot(token) × E

In practice: We skip creating the one-hot vector and directly index into the embedding matrix (it's more efficient). But mathematically, they're equivalent!

Why Embeddings Beat One-Hot Encoding

Feature	One-Hot Encoding	Learned Embeddings
Dimensionality	vocab_size (32,000+)	Compact (768)
Semantic Similarity	None - all words equally distant	Captures meaning - similar words are close
Sparsity	99.997% zeros (sparse)	Dense - every dimension used
Parameters	0 (fixed representation)	Learned from data (trainable)
Generalization	Poor - can't handle unseen words	Good - subword tokens help

Storage Efficiency Example

One-Hot (GPT-4 vocab)

32,768 dimensions per token

For sequence of 100 tokens:

3,276,800 values stored

Embeddings (768-dim)

768 dimensions per token

For sequence of 100 tokens:

76,800 values stored

42× more efficient! Plus embeddings carry semantic meaning.

Application: Understanding Sentiment

AI systems use embeddings to understand that these messages express similar sentiment:

Negative Sentiment Messages

"This product is terrible"

Embeddings map to similar vector space

"Very disappointed with quality"

Embeddings map to similar vector space

"Worst purchase ever"

Embeddings map to similar vector space

All map to nearby vectors in embedding space → AI understands shared negative sentiment

The Magic: Words like "terrible", "disappointed", and "worst" have learned to live close together in vector space because they appear in similar contexts during training. This is why embeddings are so powerful for AI understanding!

Positional Encodings: Teaching Models About Order

Position Matters in Language

Consider these two messages:

Message 1:

"The customer canceled the refund"

Meaning: Customer withdrew their refund request

Message 2:

"The refund canceled the customer"

Meaning: Grammatically odd, unclear intent

Same words, different order, completely different meaning! But embeddings alone don't capture position. "customer" and "refund" get the same embedding vectors regardless of their position in the sentence.

The Key Observation

When we convert tokens to embeddings, we lose all information about where the token appears in the sequence. The embedding for "customer" is identical whether it's the first word or the last word.

Sequence: "The customer needs help"

The → [0.1, 0.5, …] Position 0?

customer → [0.3, 0.7, …] Position 1?

needs → [0.5, 0.2, …] Position 2?

help → [0.8, 0.4, …] Position 3?

Embeddings have no position information built in!

The Solution: Positional Encodings

Positional encodings solve this by adding position information to our embeddings. The idea is brilliantly simple: create special vectors that encode position, and add them to the word embeddings.

The Final Input to Transformer

Final Input = Word Embedding + Positional Encoding

By adding positional information, each token now has both meaning and position!

Visual Example: Position 5

Word Embedding ("help")

[0.8, 0.4, 0.6, -0.3, …]

Positional Encoding (pos=5)

[0.0, 0.8, -0.5, 0.9, …]

Final Input

[0.8, 1.2, 0.1, 0.6, …]

Sinusoidal Positional Encoding: The Mathematics

The original Transformer paper (2017) introduced sinusoidal positional encodings - a clever mathematical approach that doesn't require training. Here are the formulas:

For even dimensions (i = 0, 2, 4, …):

PE (pos, 2i) = sin (pos / 10000 2i/d)

For odd dimensions (i = 1, 3, 5, …):

PE (pos, 2i+1) = cos (pos / 10000 2i/d)

Where:

pos = Position in sequence (0, 1, 2, …)

i = Dimension index (0 to d/2)

d = Embedding dimension (e.g., 768)

Concrete Example: Computing PE for Position 0, Dimension 0-3

Let's compute positional encoding for position 0 with d=768:

Dimension 0 (i=0, even):

PE(0, 0) = sin(0 / 10000^0/768) = sin(0 / 1) = sin(0) = 0.0

Dimension 1 (i=0, odd):

PE(0, 1) = cos(0 / 10000^0/768) = cos(0 / 1) = cos(0) = 1.0

Dimension 2 (i=1, even):

PE(0, 2) = sin(0 / 10000^2/768) = sin(0 / 1.023) = sin(0) = 0.0

Dimension 3 (i=1, odd):

PE(0, 3) = cos(0 / 10000^2/768) = cos(0 / 1.023) = cos(0) = 1.0

Result for position 0: PE(0) = [0.0, 1.0, 0.0, 1.0, …]

Position 1, Dimensions 0-3

Now let's compute for position 1:

Dimension 0:

PE(1, 0) = sin(1 / 1) = sin(1) ≈ 0.841

Dimension 1:

PE(1, 1) = cos(1 / 1) = cos(1) ≈ 0.540

Dimension 2:

PE(1, 2) = sin(1 / 1.023) ≈ sin(0.977) ≈ 0.828

Dimension 3:

PE(1, 3) = cos(1 / 1.023) ≈ cos(0.977) ≈ 0.561

Result for position 1: PE(1) = [0.841, 0.540, 0.828, 0.561, …]

Notice: Different from position 0, giving each position a unique signature!

Why Sinusoidal Functions?

Unique for Each Position

Different positions produce different patterns of sine and cosine values

Smooth Transitions

Adjacent positions have similar encodings (position 5 and 6 are close)

Bounded Values

Sin and cos always output values between -1 and 1 (same range as embeddings)

Extrapolation

Can handle sequences longer than what was seen during training

Linear Relationships

PE(pos+k) can be expressed as a linear function of PE(pos), helping the model learn relative positions

Visualizing Positional Encodings

Here's what the first 4 dimensions look like across 10 positions:

Dim 0:

0.00 0.84 0.91 0.14 -0.76 -0.96 -0.28 0.66 0.99 0.41

Dim 1:

1.00 0.54 -0.42 -0.99 -0.65 0.28 0.96 0.75 -0.15 -0.91

Pos 0Pos 1Pos 2Pos 3Pos 4 Pos 5Pos 6Pos 7Pos 8Pos 9

Each position gets a unique pattern - like a barcode for position!

Alternative Approaches (2025)

While sinusoidal encodings are still widely used, modern research has introduced alternatives that can perform better for specific tasks:

Learned Positional Embeddings

Used by BERT, GPT

Approach: Treat position encodings as trainable parameters, just like word embeddings.

Pros: Can learn task-specific position information

Cons: Fixed maximum sequence length, can't extrapolate

PE ∈ ℝ^{max_seq_len × d_model} (trainable parameters)

RoPE (Rotary Position Embedding)

LLaMA 2/3, PaLM

Approach: Encode position by rotating the embedding vectors in a high-dimensional space.

Pros: Better for long sequences, captures relative positions naturally

Cons: More complex implementation

2025 Status: Dominant choice for large language models

ALiBi (Attention with Linear Biases)

Used by BLOOM

Approach: Instead of adding to embeddings, directly bias the attention scores based on distance.

Pros: Extremely simple, excellent extrapolation, no added parameters

Cons: Less flexible than learned approaches

Attention_bias = -m × |i - j|, where m is a head-specific slope

Relative Position Encodings

Transformer-XL, T5

Approach: Encode the relative distance between tokens rather than absolute positions.

Pros: More generalizable, captures what often matters (relative position)

Cons: More computation required

2025 Trends in Positional Encoding

�

RoPE Dominance: Most new LLMs (LLaMA 3, Mistral, etc.) use RoPE

�

Longer Contexts: Focus on methods that handle 100K+ token sequences

�

No Position? Some research explores position-less architectures (e.g., Mamba, RWKV)

Why This Matters

In real-world applications, positional encodings enable understanding of:

Temporal Order

"First I called, then I emailed, finally I gave up"

The AI understands this is a sequence of events in time order.

Dependency Structure

"The product that I ordered last week arrived damaged"

"that I ordered last week" modifies "product" - position helps track these relationships.

Negation Scope

"Not satisfied with the resolution"

The word "Not" at the beginning flips the sentiment of the entire phrase.

Question vs Statement

"Can I get a refund?" vs "I can get a refund"

Word order distinguishes questions from statements - critical for routing!

Bottom Line: Positional encodings are the secret ingredient that allows transformers to understand that language is not just a bag of words - it's a sequence where order carries meaning.

Training & Applications: From Learning to Deployment

How Do Embeddings Learn Meaning?

We've seen what embeddings are and how they're used. But where do these magical numbers come from? How does the model learn that "cat" and "dog" should have similar embeddings?

The Core Learning Principle

"You shall know a word by the company it keeps" - J.R. Firth (1957)

Words that appear in similar contexts tend to have similar meanings. This is called the distributional hypothesis, and it's the foundation of embedding learning.

Learning from Context

"The cat is sleeping on the couch"

"The dog is sleeping on the couch"

"The cat chased a mouse"

"The dog chased a ball"

"cat" and "dog" appear with similar surrounding words: "The ___ is sleeping", "The ___ chased". The model learns to give them similar embeddings because they appear in similar contexts.

Training Objectives: How Models Learn

Embeddings are learned as part of larger model training. Here are the main approaches used in 2025:

Language Modeling (Next Token Prediction)

Used by: GPT-3/4, LLaMA, Claude

Train the model to predict the next word in a sequence. This is the dominant approach for modern LLMs.

Input: "The customer wants a"

Target: "refund"

Objective: Maximize P(word_t | word₁, word₂, …, word_t-1)

The embedding layer is the first layer of the model. As the model learns to predict words, the embeddings automatically learn to capture meaning!

Masked Language Modeling (MLM)

Used by: BERT, RoBERTa

Randomly mask out words in a sentence and train the model to predict them from context.

Input: "The customer wants a [MASK]"

Target: "refund"

Objective: Maximize P(word_masked | context)

Unlike GPT which only sees past context, BERT sees both past and future context (bidirectional).

Contrastive Learning

Used by: SimCLR, Sentence-BERT, OpenAI text-embedding-3

Train embeddings to be similar for related items and different for unrelated items.

Similar pair: "Product defective" ↔ "Item broken" → embeddings should be close

Different pair: "Product defective" ↔ "Delivery fast" → embeddings should be far apart

Objective: Minimize distance for similar pairs, maximize for dissimilar pairs

State-of-the-Art Embedding Models (2025)

The embedding landscape has evolved dramatically. Here are the leading models as of 2025:

General-Purpose Text Embeddings

OpenAI text-embedding-3-large Top Performer

Dimensions: 3,072

Context: 8,191 tokens

MTEB Score: 64.6

Best for: General text similarity, semantic search

Cohere embed-v3 Multilingual

Dimensions: 1,024

Languages: 100+

MTEB Score: 64.5

Best for: Multilingual applications, classification

Voyage AI voyage-large-2 Specialized

Dimensions: 1,536

Context: 16,000 tokens

MTEB Score: 63.8

Best for: Domain-specific retrieval, long documents

Open-Source Alternatives

Sentence-Transformers (all-mpnet-base-v2)

Dimensions: 768

Size: 420MB

Best for: Self-hosted, privacy-sensitive applications

BGE (BAAI General Embedding)

Dimensions: 1,024

MTEB Score: 63.9

Best for: High-quality open-source option

MTEB (Massive Text Embedding Benchmark): The standard benchmark for evaluating embedding models across 56 datasets covering 8 tasks (classification, clustering, retrieval, etc.). Scores are out of 100. As of October 2025, top models score in the mid-60s.

Real-World Applications

Embeddings power countless AI applications. Here are the major use cases in 2025:

�

Semantic Search

Convert documents and queries to embeddings, then find relevant documents using cosine similarity. This enables AI systems to retrieve information based on meaning rather than exact keyword matches.

Example: User searches "How to cancel subscription" → finds docs about cancellation even if they don't contain the exact words.

�

Recommendation Systems

Embed items and users, recommend items whose embeddings are close to the user's preferences.

Example: Netflix recommends shows with similar embeddings to what you've watched.

�

Classification & Sentiment Analysis

Use embeddings as input features to classifiers for tasks like spam detection, sentiment analysis, topic classification.

Example: Classify customer emails as urgent/non-urgent based on their embeddings.

�

Clustering & Topic Discovery

Group similar documents by clustering their embeddings to discover themes and patterns.

Example: Automatically group customer complaints into categories (billing, technical, shipping).

�

Duplicate Detection

Find near-duplicate content by computing embedding similarity.

Example: Detect when customers submit the same issue multiple times.

�

Translation & Cross-Lingual Tasks

Multilingual embeddings map words from different languages to the same vector space.

Example: Search English query against Spanish documents.

Real-World Application: Building an Intelligent Support System

Let's bring it all together. Here's how embeddings work in a modern support system:

Customer submits ticket

"My order #12345 hasn't arrived yet, it's been 2 weeks"

↓

Convert to embedding

Tokenize → "My order #12345 …" → [234, 5432, 891, …]
Embed → [0.32, -0.45, 0.78, …] (768 dimensions)

↓

Find similar past tickets

Compute cosine similarity with 100,000 historical tickets
Find top 5 most similar resolved tickets

↓

Generate response

LLM receives: Current ticket + Similar past tickets + Resolution templates
Generates: "I see your order is delayed. Based on similar cases, here's what we can do …"

↓

Route or auto-respond

If confidence > 0.8: Send automated response
Else: Route to human agent with suggested response

Chapter Summary: The Journey Complete

You've now learned the complete story of how words become numbers that AI can understand:

Tokenization: Breaking text into processable pieces using BPE

Embeddings: Converting tokens to dense vectors that capture semantic meaning

Positional Encodings: Adding position information so order matters

Training: Learning embeddings from massive text corpora using language modeling

Applications: Powering search, recommendations, classification, and more

The Magic: Embeddings transform the discrete, symbolic world of language into continuous geometric space where meaning becomes distance, similarity becomes proximity, and relationships become directions. This is how modern AI bridges the gap between human communication and machine computation.

Ready to Continue?

You now understand how data becomes numbers through embeddings. Next, we'll explore non-linearity and why stacking layers matters - the fundamental principle that enables neural networks to learn complex patterns and go beyond simple linear transformations. We'll see how adding non-linear activation functions between layers transforms a simple equation into a powerful learning system.

Key Takeaways

🔤 From Words to Vectors

Core concepts from this chapter:

Embeddings Bridge Language and Math

Neural networks only understand numbers. Embeddings convert words, tokens, and concepts into dense vectors (lists of numbers) that capture semantic meaning in a way machines can process.

Tokenization Breaks Text into Pieces

Before embedding, text must be split into tokens. Byte-Pair Encoding (BPE) finds common patterns and creates a vocabulary that balances flexibility (handling new words) with efficiency (common words get single tokens).

Similar Words Have Similar Vectors

Embeddings are learned from massive text data. Words that appear in similar contexts (like "cat" and "dog") end up with vectors that are close together in the embedding space, capturing semantic relationships.

Positional Encodings Add Order Information

Embeddings alone don't capture word order. "The customer canceled the refund" and "The refund canceled the customer" would look identical. Positional encodings add position information so models understand sequence order.

Geometry Captures Meaning

In embedding space, similarity becomes distance. Close vectors = similar meaning. This geometric structure enables powerful operations: vector arithmetic, similarity search, and clustering related concepts.

Embeddings Power Modern NLP

Every modern language model uses embeddings: search engines, chatbots, translation systems, recommendation engines. They're the input layer that makes language processing with neural networks possible.

Key Transformation:
Text → Tokens → Embeddings → Embeddings + Positional Encodings → Neural Network Input

This pipeline transforms discrete symbols (words) into continuous geometric space where meaning becomes computable.

� Chapter 6 Chapter 8 �