Vectors & Deep Learning

From points to vectors: the foundation of modern AI

From Classification to Vectors: A Deeper Understanding

In the previous chapter, we learned how machines draw hyperplanes (decision boundaries) to separate customers into RENEW vs CHURN. We represented each customer with two numbers: months subscribed and usage hours. Customer A was (10, 35), Customer B was (12, 38), and so on.

But there's a more powerful way to think about this data - one that unlocks the entire field of modern AI. Those customer coordinates aren't just "points" - they're vectors, and we have powerful mathematical tools to compare them. Let's dive in.

Scalars: The Simple Numbers

First, let's understand what a scalar is. It's just a single number - a magnitude without direction.

Scalar Examples

Temperature: 72°F

Just a number - how hot or cold

Price: $450,000

One number representing cost

Months: 10

A single value from our customer data

Weight: 1.5 (from our model)

A single learned parameter

But What About Multiple Values Together?

What if we want to represent something that has multiple dimensions? Like Customer A who has BOTH 10 months subscribed AND 35 hours usage? That's where vectors come in!

Vectors: Everything Becomes Numbers

A vector is an ordered list of numbers that represents something in multi-dimensional space. Remember our customers from Chapter 3? They were vectors all along!

Customer A as a Vector

Customer A =

[

10 months

35 hrs/week

]

This is the RENEW customer from our scatter plot!

Customer E as a Vector

Customer E =

[

2 months

8 hrs/week

]

This is the CHURN customer from our scatter plot!

Key Insight

Every customer in our classification problem is a vector! The scatter plot from Chapter 3? That's a visualization of vectors in 2D space. Machine learning is really about finding patterns in high-dimensional vector spaces.

🎮 Interactive Vector Playground

Click anywhere in the canvas to create vectors and see them come to life!

Current Vector

 Position: [0, 0] 
 Magnitude: 0.00 
 Angle: 0° 

Vectors Created

💡 Try this: Create several vectors and notice how they're all arrows pointing from the origin (center). Longer arrows = larger magnitude, different directions = different angles!

Vector Properties: Direction and Magnitude

Every vector has two fundamental properties. Think of them like an arrow:

1. Magnitude (Length)

What it means: How long is the arrow? This tells us the "strength" or "intensity" of the customer's behavior.

Magnitude is the length of the vector arrow from origin to the endpoint

How we calculate it: Using the Pythagorean theorem (like finding the hypotenuse of a triangle):

||v|| = √(v₁² + v₂² + v₃² + ...)

Example: Customer A = [10, 35]

||A|| = √(10² + 35²) = √(100 + 1,225) = √1,325 ≈ 36.40

Example: Customer E = [2, 8]

||E|| = √(2² + 8²) = √(4 + 64) = √68 ≈ 8.25

Interpretation: Customer A has a magnitude of 36.40 while Customer E has only 8.25. This means Customer A shows much stronger engagement overall.

2. Direction

What it means: Which way is the arrow pointing? This tells us the "type" or "pattern" of behavior.

Direction shows where vectors point — A & B are similar, E is different

Why it matters: Two customers might have similar strength (magnitude) but completely different patterns (direction).

Example: Customer A [10, 35] and Customer B [12, 38] point in nearly the same direction — both have high support tickets relative to purchases (both RENEW).

But Customer A [10, 35] and Customer E [2, 8] point in different directions, indicating different behavioral patterns!

Comparing Vectors: Two Fundamental Tools

The Questions That Power AI

Now that we understand our customers are vectors, we need ways to compare them. But "compare" can mean different things depending on what you're trying to measure:

Question 1: How aligned + how strong?

Sometimes you want to know both the direction AND the magnitude. Like in attention mechanisms: "Which words are pointing in a relevant direction AND have strong signals?"

→ This needs dot product

Question 2: Same pattern/direction?

Sometimes you only care about the direction, ignoring scale. Like our customer problem: "Do these customers follow the same behavioral pattern?"

→ This needs cosine similarity

Both operations power AI systems—recommendation engines, search, chatbots, and classification. Let's understand each tool and when to use it.

Visualizing Direction: The Core Concept

Each customer is an arrow (vector) starting from the origin. Similar customers have arrows pointing in the same direction.

✓ Same Direction

Nearly parallel arrows
Both RENEW

✗ Different Direction

Large angle between arrows
A: RENEW, E: CHURN

The key insight: Stand at the origin and look along each arrow. If two arrows point the same way, the customers are similar!

Tool #1: The Dot Product

The dot product is a fundamental way to compare two vectors. It combines information about both their direction (alignment) and their magnitude (strength). The recipe is simple: multiply matching numbers, then add them all up.

The Dot Product Recipe

a · b = (a₁ × b₁) + (a₂ × b₂) + (a₃ × b₃) + ...

Pair up the numbers, multiply each pair, add the results

Simple Example

Our Vectors

Vector a = [3, 4]

Vector b = [2, 5]

Multiply Matching Pairs

First components: 3 × 2 = 6

Second components: 4 × 5 = 20

Add Them Up

a · b = 6 + 20 = 26

What Does This Number Tell Us?

The dot product combines two things: how aligned the vectors are (direction) and how large they are (magnitude).

Large positive number: Vectors point in similar directions AND/OR have large magnitudes
Zero: Vectors are perpendicular (90° angle)
Negative: Vectors point in opposite directions

Key insight: The dot product value depends on BOTH the angle between vectors AND their lengths. Two vectors perfectly aligned (0° angle) will have a larger dot product if they're longer.

🎮 Interactive Dot Product Calculator

Enter two vectors and watch the dot product calculation unfold step by step!

Vector A

[3, 4]

Vector B

[2, 5]

📊 Step-by-Step Calculation

Step 1: Multiply first components

 3 × 2 = 6 ✓

Step 2: Multiply second components

 4 × 5 = 20 ✓

Step 3: Add them up

 6 + 20 = 26 🎉

✅ Strong positive alignment!

The vectors point in similar directions

Tool #2: Cosine Similarity

Measuring Direction Only

Cosine similarity measures only the direction (angle) between vectors, completely ignoring their magnitude. It answers the question: "Are these vectors pointing in the same direction?"

The key idea: normalize vectors to length 1 first, then take the dot product. This removes magnitude from the equation, leaving only directional information.

Why normalize to length 1?

When all vectors have the same length, the only difference between their dot products comes from their direction (angle). Size no longer affects the result—you're purely comparing which way they point.

Before Normalization

Same direction, different lengths

→

Normalize

After Normalization

✓ Same length, now only direction matters!

Key idea: After normalization, both arrows have length 1. Any difference in their dot product now comes only from their direction, not their size.

The Cosine Similarity Formula

cosine_similarity(a, b) = (a · b) / (||a|| × ||b||)

Numerator (Top)

a · b

The dot product
Measures alignment

Denominator (Bottom)

||a|| × ||b||

Product of magnitudes
Removes the effect of size

The Secret: Unit Vectors

Imagine you're a teacher grading how well students point in the right direction when asked "Where is north?"

Student A:

Points north with a 1.5 foot arm

Student B:

Also points north with a 2.5 foot arm

The question: Should Student B get a better grade just because they have a longer arm?
No! They both point in the same direction. Arm length shouldn't matter.

The Solution: Normalize to Length 1

A unit vector is a vector that has been "normalized" to have a length (magnitude) of exactly 1. It keeps the same direction, but we standardize the length.

Original Vector

v = [3, 4]

||v|| = 5

→

Unit Vector

û = [0.6, 0.8]

||û|| = 1 ✓

How? Divide each component by the magnitude:
û = v / ||v|| = [3/5, 4/5] = [0.6, 0.8]
Same direction, but now length = 1

What Cosine Similarity Does

By dividing the dot product by the magnitudes, we're essentially converting both vectors to unit vectors and then asking: "If both vectors had length 1, what would their dot product be?"

This removes magnitude, leaving only direction. We're normalizing both vectors to unit vectors, so we can compare their directions fairly — just like making all students' arms the same length before judging which way they point!

Output Range: Always -1 to +1

+1.0

Same Direction

Perfect similarity

0.0

Perpendicular

No similarity

-1.0

Opposite Direction

Inverse relationship

Note: For customer data (where all values are positive), cosine similarity is typically between 0 and 1. Negative values only appear if features can be negative.

Why "Cosine"? The Angle Connection

The formula doesn't just happen to be called "cosine similarity" — it literally computes the cosine of the angle between the two vectors.

The Angle → Cosine → Similarity Mapping

cos(0°) = 1.0

Same direction
Perfect similarity

cos(90°) = 0.0

Perpendicular
No similarity

cos(5°) ≈ 0.996

Very small angle
High similarity

cos(180°) = -1.0

Opposite direction
Inverse relationship

💡 The Key Insight

Cosine Similarity = cos(angle between vectors)

Small angle → vectors point same way → cos close to 1 → high similarity
Large angle → vectors point different ways → cos close to 0 or negative → low similarity

The name isn't random — the formula literally computes the cosine of the geometric angle between arrows in vector space. This is why it perfectly measures direction while ignoring magnitude.

Calculating Cosine Similarity: Step by Step

Example Calculation

Define Vectors

Vector a = [3, 4]

Vector b = [6, 8]

(Note: b is exactly 2× larger than a, same direction)

Calculate Dot Product

a · b = (3×6) + (4×8) = 18 + 32 = 50

Calculate Magnitudes

||a|| = √(3² + 4²) = √(9 + 16) = √25 = 5

||b|| = √(6² + 8²) = √(36 + 64) = √100 = 10

Multiply Magnitudes

||a|| × ||b|| = 5 × 10 = 50

5
Divide to Get Cosine Similarity
cosine_similarity = 50 / 50 = 1.0 

Vectors a and b have cosine similarity of 1.0 (perfect similarity). Even though b is twice as large, they point in exactly the same direction!

Putting It All Together: What to Remember

Similarity = Direction

Two customers are similar when their vectors (arrows) point in the same direction in feature space.

Dot Product: Measures Direction + Magnitude

Multiply matching numbers, add them up. Captures BOTH alignment and size—perfect when both matter!

a · b = (a₁ × b₁) + (a₂ × b₂) + ...

Cosine Similarity: Measures Direction Only

Normalize by dividing by magnitudes. Removes size, isolates direction.

cosine_similarity = (a · b) / (||a|| × ||b||)

Small Angle = High Similarity

Cosine similarity measures the angle between vectors:

0° → 1.0 (same direction)
90° → 0.0 (unrelated)
180° → -1.0 (opposite direction)

🎮 Interactive Cosine Similarity Explorer

Adjust the vectors with sliders and watch how the angle and cosine similarity change in real-time!

Vector A

X: 3

Y: -2

Vector B

X: 2

Y: 0

Angle

45°

Cosine Similarity

0.71

🟢 Similar Direction

Seeing Tools in Action

Real-World Application

Let's see cosine similarity applied to a concrete problem. Dot product applications are most clear in neural networks (attention mechanisms, learned weights), which we'll explore in later chapters.

Customer Classification (Use Cosine Similarity)

The Scenario

Predict if customers will RENEW or CHURN based on their behavior: [support_tickets, usage_hours]. Different customers have different engagement levels (some use the product a lot, some don't).

Question: Should you use dot product or cosine similarity?

The Data

Let's compare what dot product and cosine similarity tell us about customer pairs:

Customer Pair	Dot Product (direction × magnitude)	Cosine Similarity (direction only)	Actual Outcome
A · B	1,450	High	Both RENEW ✓
A · C	1,260	High	Both RENEW ✓
E · F	42	High	Both CHURN ✓
A · E	300	Low	A renews, E churns

Why Cosine Similarity Works Here

Look at E · F: Their dot product is small (42) because both have low overall engagement (small vectors). But when we normalize by dividing by magnitudes, we remove the scale difference and focus purely on direction. Result: high cosine similarity—similar behavioral pattern.

E and F both churn not because they have low engagement, but because they follow a similar behavioral ratio (support tickets vs. usage hours). A machine learning classifier learns to recognize this pattern as predictive of churn. Cosine similarity lets us find customers with similar patterns regardless of their activity level.

Meanwhile, A · E have low cosine similarity—very different directions. A renews, E churns. They follow different behavioral patterns, which cosine similarity correctly identifies.

General Guidelines for Interpreting Cosine Similarity

Cosine Similarity	Label	Interpretation
0.90–1.00	High	Same pattern
0.70–0.89	Moderate	Somewhat similar
<0.70	Low	Different pattern

Important: These ranges are general guidelines only. Optimal thresholds vary significantly by domain, data distribution, and task requirements. Always validate with your specific data using metrics like precision-recall curves.

Use cosine similarity when you want to group by behavioral pattern, ignoring differences in scale or intensity.

The Space We've Been Exploring

Let's Revisit Our Customer Scatter Plot

Throughout Chapters 3 and 4, we've been plotting our customers on a 2D graph. Let's take a step back and really look at what we've created. Something profound is happening in this simple plot.

Usage Hours (per week)

Months Subscribed

What Do You Notice?

RENEW customers (green) cluster together in the upper-right
CHURN customers (red) cluster together in the lower-left
There's clear space between the two groups
Each cluster has a "center" where points are most dense

Understanding the Mathematical Structure

This scatter plot isn't just a convenient visualization — it represents a vector space, a mathematical structure where:

Each customer is represented as a point (determined by their feature vector)
Distance between points indicates how similar or different customers are
The geometry of the space naturally reveals patterns in the data

Key Insight: Clusters form automatically because customers with similar behaviors have similar vectors, which places them close together in this space. The mathematical structure reflects real-world patterns!

From Points to Spaces: What Is a Vector Space?

A vector space is a mathematical structure where:

Every Vector is a Point

Customer A = [10, 35] lives at coordinates (10, 35)

Every Point is a Vector

The point (10, 35) represents the vector [10, 35]

Geometry Captures Relationships

Distance = how different; Angle = how similar

Vector Spaces in Different Dimensions

1D Space

Just a number line. One feature only.

Example: Just "months subscribed"

2D Space

A plane. Two features.

Example: Our customer space!

3D Space

Physical space. Three features.

Example: Add "support tickets"

ND Space

Can't visualize, but same rules!

Example: 768D word embeddings

Why This Matters

Our brains can only visualize 2D or 3D, but the mathematics works in ANY dimension. The distance formula, angles, clusters - all of it works exactly the same way in 768 dimensions as it does in our 2D customer plot!

What Do Dimensions Actually Represent?

Dimensions = Traits/Features

Every dimension (axis) in our vector space represents one measurable trait or feature. Let's start with our simple example and build up to understand how LLMs work.

Our 2D Customer Space

Dimension 1 (X-axis)

= Months Subscribed

Range: 0-20 months

This dimension captures loyalty/tenure

Dimension 2 (Y-axis)

= Usage Hours per Week

Range: 0-50 hours

This dimension captures engagement level

These 2 dimensions give us a 2D snapshot of customer behavior. But what if we added more?

Expanding to 3D: Adding Another Trait

Dimension 3 (Z-axis)

= Support Tickets Opened

Range: 0-10 tickets

Now we capture satisfaction/issues

Now each customer is a point in 3D space!

Each new dimension adds one more way to describe customers. More dimensions = richer, more nuanced descriptions.

Scaling to 768D: Word Embeddings

In word embeddings, we don't manually choose what each dimension represents. Instead, the model learns which dimensions capture meaningful patterns. But we can interpret some of them:

Dimension 1

Formality

-1.0: casual

+1.0: formal

"hey" → -0.8, "greetings" → +0.7

Dimension 2

Sentiment

-1.0: negative

+1.0: positive

"terrible" → -0.9, "wonderful" → +0.9

Dimension 3

Gender Association

-1.0: feminine

+1.0: masculine

"queen" → -0.6, "king" → +0.6

Dimension 4

Time/Tense

-1.0: past

+1.0: future

"was" → -0.7, "will" → +0.7

...

764 More Dimensions

Each capturing subtle patterns like:

Abstract vs Concrete
Animate vs Inanimate
Action vs State
Literal vs Figurative
And hundreds more!

The Breakthrough Insight

768 dimensions isn't scary - it's just 768 different traits measured at once! Each dimension is like one column in a spreadsheet. More columns = more detailed description.

Our customers had 2 traits (months, usage). Words have 768 traits (formality, sentiment, gender, tense, abstractness, and 763 more). That's it!

Clusters: Where Similar Vectors Group

The Natural Formation of Clusters

Look back at our scatter plot. The RENEW and CHURN customers naturally form clusters. This isn't a coincidence - it's geometry revealing underlying patterns.

Anatomy of a Cluster

Centroid (Center)

The "average" customer in the cluster

RENEW Centroid:

Average of A, B, C, D = [(10+12+14+11)/4, (35+38+32+33)/4]

= [11.75, 34.5]

Within-Cluster Variance

How spread out are the points?

RENEW cluster: Low variance (tight cluster)

Points are all similar to each other

Between-Cluster Distance

How far apart are the clusters?

RENEW centroid to CHURN centroid: Large!

This is why classification works!

Clustering in Higher Dimensions

In our 2D space, we can see the clusters visually. But the same concept applies in 768 dimensions! Word embeddings cluster by:

Fruit Words Cluster

apple, banana, orange, grape, mango

All point in similar directions in 768D space

Color Words Cluster

red, blue, green, yellow, purple

Different cluster, different region of space

Emotion Words Cluster

happy, sad, angry, excited, calm

Yet another cluster for abstract concepts

Just like our RENEW and CHURN customers cluster because they share behavioral patterns, words cluster because they share semantic patterns!

Key Takeaways

📊 Vectors & Deep Learning Foundations

Core concepts from this chapter:

Two Tools, Two Questions

Dot product combines direction and magnitude. Cosine similarity isolates direction only. Choose based on your problem.

The Formulas Are Simple

Dot product: multiply matching elements, add them up. Cosine similarity: dot product ÷ magnitudes.

Both Tools Power Modern AI

LLMs use dot product in attention and cosine similarity for embeddings - understanding both is essential.

What Each Tool Measures

Dot product: larger when vectors align AND are large. Cosine similarity: ignores size, measures angle between vectors.

Vector Spaces Reveal Patterns

Vectors aren't just numbers - they're points in space. Similar vectors cluster together, and this geometry naturally reveals patterns in data.

Dimensions = Features

Each dimension represents one measurable trait. Our 2D customer space has 2 traits. Word embeddings have 768 traits. Same math, more detail.

High Dimensions Work the Same Way

We can't visualize 768D, but the mathematics is identical. Distance, angles, and clusters work exactly the same in any dimension.

The Foundation:
Vectors represent data as points in space. Dot product measures alignment and magnitude. Cosine similarity measures pure direction.

These simple operations power everything from search engines to language models. Master these fundamentals, and deep learning becomes clear.