From points to vectors: the foundation of modern AI
In the previous chapter, we learned how machines draw hyperplanes (decision boundaries) to separate customers into RENEW vs CHURN. We represented each customer with two numbers: months subscribed and usage hours. Customer A was (10, 35), Customer B was (12, 38), and so on.
There's a more general way to think about this data. Those customer coordinates aren't just "points" - they're vectors, and we have mathematical tools to compare them. Let's explore how.
In Chapter 4, we successfully classified customers as RENEW or CHURN using a decision boundary. But we never answered a fundamental question:
To answer these questions, we need mathematical tools to compare multi-dimensional data. That's what this chapter covers.
Two fundamental tools used in search engines, recommendation systems, and language models: dot product and cosine similarity. We'll understand when to use each and how AI systems use them to find patterns.
First, let's understand what a scalar is. It's just a single number - a magnitude without direction.
Just a number - how hot or cold
One number representing cost
A single value from our customer data
A single learned parameter
What if we want to represent something that has multiple dimensions? Like Customer A who has BOTH 10 months subscribed AND 35 hours usage? That's where vectors come in!
A vector is an ordered list of numbers that represents something in multi-dimensional space. In machine learning, everything—customers, products, text, images—gets represented as vectors.
This is the RENEW customer from our scatter plot!
This is the CHURN customer from our scatter plot!
Every customer in our classification problem is a vector! When we see scatter plots showing data points in 2D space, those are visualizations of vectors. Machine learning is really about finding patterns in high-dimensional vector spaces.
Click anywhere in the canvas to create vectors and see them come to life!
Try this: Create several vectors and notice how they're all arrows pointing from the origin (center). Longer arrows = larger magnitude, different directions = different angles!
Every vector has two fundamental properties. Think of them like an arrow:
What it means: How long is the arrow? This tells us the "strength" or "intensity" of the customer's behavior.
Magnitude is the length of the vector arrow from origin to the endpoint
How we calculate it: Using the Pythagorean theorem (like finding the hypotenuse of a triangle):
Example: Customer A = [10, 35]
||A|| = √(10² + 35²) = √(100 + 1,225) = √1,325 ≈ 36.40
Example: Customer E = [2, 8]
||E|| = √(2² + 8²) = √(4 + 64) = √68 ≈ 8.25
Interpretation: Customer A has a magnitude of 36.40 while Customer E has only 8.25. This means Customer A shows much stronger engagement overall.
What it means: Which way is the arrow pointing? This tells us the "type" or "pattern" of behavior.
Direction shows where vectors point — A & B are similar, E is different
Why it matters: Two customers might have similar strength (magnitude) but completely different patterns (direction).
Example: Customer A [10, 35] and Customer B [12, 38] point in nearly the same direction
— both have high support tickets relative to purchases (both RENEW).
But Customer A [10, 35] and Customer E [2, 8] point in different directions,
indicating different behavioral patterns!
Let's look at a situation that reveals something important about comparing vectors.
[10, 35]
10 support tickets, 35 hrs/week usage
Outcome: RENEW
[12, 38]
12 support tickets, 38 hrs/week usage
Outcome: RENEW
[2, 8]
2 support tickets, 8 hrs/week usage
Outcome: CHURN
[1, 5]
1 support ticket, 5 hrs/week usage
Outcome: CHURN
Look at these two customers:
You'd probably say: "These are very similar customers." Both are power users with lots of activity.
Now look at these:
These look different, right? Customer E has twice as many logins and support calls as Customer F.
But wait—let's look at their behavior, not just the numbers.
Customer E's behavior:
Customer F's behavior:
The insight: E and F have the same type of relationship with the product. They both struggle and need lots of support relative to how often they log in.
The only difference? Customer E is more active overall. But their behavior pattern is nearly identical.
This is the key insight: Sometimes we want to find similar patterns, regardless of scale. That's when cosine similarity shines—it ignores the magnitude and focuses purely on the ratio between dimensions.
It depends on what we're trying to find. Let's look at two different business questions:
The goal: Find customers similar to Customer A who are high-value power users to invite to an exclusive beta program.
Customer A: [10 logins, 35 support calls] ← Reference
Who should we invite?
Use Dot Product:
Why dot product works here: We want someone who is BOTH a heavy user AND has similar values. Dot product rewards high magnitude—it finds power users.
Your goal: Customer E is struggling (lots of support calls per login). You want to find others with the same struggle pattern to send them a tutorial, regardless of how active they are.
Customer E: [2 logins, 8 support calls] ← Your reference (1:4 ratio = struggling)
Who else is struggling the same way?
Use Cosine Similarity:
Why cosine similarity works here: You don't care if someone is super active or barely active. You only care if they have the same behavioral pattern—same ratio of logins to support calls.
The Key Difference:
Now that we understand why we need two tools, let's see what each one does.
Measures: Direction AND magnitude together
Use when: Scale matters. You want to know if vectors align AND have strong signals.
Measures: Direction ONLY (ignores magnitude)
Use when: You want patterns regardless of scale. Small and large vectors with the same direction are equally similar.
Both operations power modern AI—from search engines to ChatGPT. Let's learn how each one works.
Each customer is an arrow (vector) starting from the origin. Similar customers have arrows pointing in the same direction.
Nearly parallel arrows
Both RENEW
Large angle between arrows
A: RENEW, E: CHURN
The key insight: Stand at the origin and look along each arrow. If two arrows point the same way, the customers are similar!
The dot product is a fundamental way to compare two vectors. It combines information about both their direction (alignment) and their magnitude (strength). The recipe is simple: multiply matching numbers, then add them all up.
Pair up the numbers, multiply each pair, add the results
Vector a = [3, 4]
Vector b = [2, 5]
First components: 3 × 2 = 6
Second components: 4 × 5 = 20
a · b = 6 + 20 = 26
The dot product combines two things: how aligned the vectors are (direction) and how large they are (magnitude).
Key insight: The dot product value depends on BOTH the angle between vectors AND their lengths. Two vectors perfectly aligned (0° angle) will have a larger dot product if they're longer.
Enter two vectors and watch the dot product calculation unfold step by step!
Cosine similarity measures only the direction (angle) between vectors, completely ignoring their magnitude. It answers the question: "Are these vectors pointing in the same direction?"
The key idea: normalize vectors to length 1 first, then take the dot product. This removes magnitude from the equation, leaving only directional information.
When all vectors have the same length, the only difference between their dot products comes from their direction (angle). Size no longer affects the result—we're purely comparing which way they point.
Same direction, different lengths
Same length, now only direction matters!
Key idea: After normalization, both arrows have length 1. Any difference in their dot product now comes only from their direction, not their size.
The dot product
Measures alignment
Product of magnitudes
Removes the effect of size
Imagine a teacher grading how well students point in the right direction when asked "Where is north?"
Student A:
Points north with a 1.5 foot arm
Student B:
Also points north with a 2.5 foot arm
The question: Should Student B get a better grade just because they have a longer arm?
No! They both point in the same direction. Arm length shouldn't matter.
The cosine similarity formula divides the dot product by the magnitudes (||a|| × ||b||). This is mathematically equivalent to converting both vectors to unit vectors—vectors with length exactly 1—and then comparing them.
Why this doesn't change the vector's meaning:
A vector has two properties: direction (which way it points) and magnitude (how long it is). When we normalize to a unit vector, we're scaling it down (or up) to length 1, but the direction stays exactly the same.
Think of it like this: An arrow pointing northeast is still pointing northeast whether it's 1 inch long or 10 feet long. The direction is preserved—only the scale changes.
How to normalize: Divide each component by the magnitude
û = v / ||v|| = [3/5, 4/5] = [0.6, 0.8]
✓ Same direction (northeast)
✓ Now length = 1
By dividing by (||a|| × ||b||), the formula asks: "If both vectors had length 1, what would their dot product be?"
This removes magnitude from the equation, leaving only direction. It's like making all students' arms the same length before judging which way they point. Now we can fairly compare: Do they point the same way?
Note: For customer data (where all values are positive), cosine similarity is typically between 0 and 1. Negative values only appear if features can be negative.
The formula doesn't just happen to be called "cosine similarity" — it literally computes the cosine of the angle between the two vectors.
Same direction
Perfect similarity
Perpendicular
No similarity
Very small angle
High similarity
Opposite direction
Inverse relationship
Cosine Similarity = cos(angle between vectors)
The name isn't random — the formula literally computes the cosine of the geometric angle between arrows in vector space. This is why it perfectly measures direction while ignoring magnitude.
Vector a = [3, 4]
Vector b = [6, 8]
(Note: b is exactly 2× larger than a, same direction)
a · b = (3×6) + (4×8) = 18 + 32 = 50
||a|| = √(3² + 4²) = √(9 + 16) = √25 = 5
||b|| = √(6² + 8²) = √(36 + 64) = √100 = 10
||a|| × ||b|| = 5 × 10 = 50
Vectors a and b have cosine similarity of 1.0 (perfect similarity). Even though b is twice as large, they point in exactly the same direction!
Two customers are similar when their vectors (arrows) point in the same direction in feature space.
Multiply matching numbers, add them up. Captures BOTH alignment and size—perfect when both matter!
a · b = (a₁ × b₁) + (a₂ × b₂) + ...
Normalize by dividing by magnitudes. Removes size, isolates direction.
cosine_similarity = (a · b) / (||a|| × ||b||)
Cosine similarity measures the angle between vectors:
Adjust the vectors with sliders and watch how the angle and cosine similarity change in real-time!
Let's see cosine similarity applied to a concrete problem. Dot product applications are most clear in neural networks (attention mechanisms, learned weights), which we'll explore in later chapters.
Predict if customers will RENEW or CHURN based on their behavior: [support_tickets, usage_hours]. Different customers have different engagement levels (some use the product a lot, some don't).
Question: Should you use dot product or cosine similarity?
Let's compare what dot product and cosine similarity tell us about customer pairs:
| Customer Pair | Dot Product (direction × magnitude) | Cosine Similarity (direction only) | Actual Outcome |
|---|---|---|---|
| A · B | 1,450 | High | Both RENEW |
| A · C | 1,260 | High | Both RENEW |
| E · F | 42 | High | Both CHURN |
| A · E | 300 | Low | A renews, E churns |
Look at E · F: Their dot product is small (42) because both have low overall engagement (small vectors). But when we normalize by dividing by magnitudes, we remove the scale difference and focus purely on direction. Result: high cosine similarity—similar behavioral pattern.
E and F both churn not because they have low engagement, but because they follow a similar behavioral ratio (support tickets vs. usage hours). A machine learning classifier learns to recognize this pattern as predictive of churn. Cosine similarity lets us find customers with similar patterns regardless of their activity level.
Meanwhile, A · E have low cosine similarity—very different directions. A renews, E churns. They follow different behavioral patterns, which cosine similarity correctly identifies.
| Cosine Similarity | Label | Interpretation |
|---|---|---|
| 0.90–1.00 | High | Same pattern |
| 0.70–0.89 | Moderate | Somewhat similar |
| <0.70 | Low | Different pattern |
Important: These ranges are general guidelines only. Optimal thresholds vary significantly by domain, data distribution, and task requirements. Always validate with your specific data using metrics like precision-recall curves.
Use cosine similarity when grouping by behavioral pattern, ignoring differences in scale or intensity.
Throughout Chapters 4 and 5, we've been plotting our customers on a 2D graph. Let's take a step back and really look at what we've created. Something profound is happening in this simple plot.
This scatter plot isn't just a convenient visualization — it represents a vector space, a mathematical structure where:
Key Insight: Clusters form automatically because customers with similar behaviors have similar vectors, which places them close together in this space. The mathematical structure reflects real-world patterns!
We've been plotting customers as points in a 2D space. Each customer is a vector [months, hours], and each vector becomes a point we can plot. Similar customers cluster together because their vectors are similar. This geometric space where vectors live is called a "vector space."
The key insight: every vector is a point, and every point is a vector. The geometry of the space — distances, angles, clusters — captures real relationships in the data.
Our customer example uses 2 dimensions (months and hours). But vector spaces work with any number of dimensions — and the same mathematical principles apply to all of them:
Just a number line. One feature only.
Example: Just "months subscribed"
A plane. Two features.
Example: Our customer space!
Physical space. Three features.
Example: Add "support tickets"
Can't visualize, but same rules!
Example: 10+ customer features
Whether dealing with 2 dimensions or 100 dimensions, vector spaces follow two simple mathematical rules. These rules are what make all the operations possible:
This is called "closure under addition" - the result always stays in the same space.
Why this matters: When we combine customer patterns, we get meaningful results. The sum represents "combined behavior" that makes sense in our feature space.
This is called "closure under scalar multiplication" - stretching or shrinking always stays in the space.
Why this matters: We can scale patterns up or down. Doubling the vector represents "twice as intense" of that behavior pattern.
These two properties might seem simple, but they enable:
Looking ahead: In Chapter 7 (Embeddings), we'll see these exact same rules applied to word meanings instead of customer features. The mathematical principles are identical - only what the vectors represent changes!
We just learned that vector spaces can have 2 dimensions, 3 dimensions, or many more dimensions. But here's the question we haven't answered yet: What are these dimensions actually measuring?
In our customer example, we chose 2 specific things to measure: months subscribed and usage hours. Those became our 2 dimensions. But we could have chosen different measurements. Let's understand this crucial concept.
Every dimension (axis) in our vector space represents one measurable trait or feature. Let's start with our simple example and build up to understand how machine learning models use these features.
This dimension captures loyalty/tenure
This dimension captures engagement level
These 2 dimensions give us a 2D snapshot of customer behavior. But what if we added more?
We've covered vectors—how they represent data as points in space, how dot products and cosine similarity measure relationships, and how vector spaces work in any dimension.
But what about processing thousands of customer vectors at once? Or transforming an entire dataset from one vector space to another? That's where matrices come in.
We'll see how matrices let us operate on many vectors simultaneously, how neural networks use matrix multiplication to process entire batches of data, and why GPUs are essential for modern AI. The vector operations covered here become the building blocks for understanding how deep learning computes.
Core concepts from this chapter:
Dot product combines direction and magnitude. Cosine similarity isolates direction only. Choose based on your problem.
Dot product: multiply matching elements, add them up. Cosine similarity: dot product ÷ magnitudes.
Classification models use dot product for decision boundaries and cosine similarity for comparing feature patterns - understanding both is essential.
Dot product: larger when vectors align AND are large. Cosine similarity: ignores size, measures angle between vectors.
Vectors aren't just numbers - they're points in space. Similar vectors cluster together, and this geometry naturally reveals patterns in data.
Each dimension represents one measurable trait. Our 2D customer space has 2 traits (months subscribed, usage hours). Real systems add more features. Same math, more detail.
We can't visualize 10D or 100D customer spaces, but the mathematics is identical. Distance, angles, and clusters work exactly the same in any dimension.
The Foundation:
Vectors represent data as points in space. Dot product measures alignment and magnitude. Cosine similarity measures pure direction.
These operations are used in search engines, recommendation systems, and language models. Understanding these fundamentals helps make sense of more complex deep learning concepts.
Test what we've learned about vectors and deep learning!