From points to vectors: the foundation of modern AI
In the previous chapter, we learned how machines draw hyperplanes (decision boundaries) to separate customers into RENEW vs CHURN. We represented each customer with two numbers: months subscribed and usage hours. Customer A was (10, 35), Customer B was (12, 38), and so on.
But there's a more powerful way to think about this data - one that unlocks the entire field of modern AI. Those customer coordinates aren't just "points" - they're vectors, and we have powerful mathematical tools to compare them. Let's dive in.
First, let's understand what a scalar is. It's just a single number - a magnitude without direction.
Just a number - how hot or cold
One number representing cost
A single value from our customer data
A single learned parameter
What if we want to represent something that has multiple dimensions? Like Customer A who has BOTH 10 months subscribed AND 35 hours usage? That's where vectors come in!
A vector is an ordered list of numbers that represents something in multi-dimensional space. Remember our customers from Chapter 3? They were vectors all along!
This is the RENEW customer from our scatter plot!
This is the CHURN customer from our scatter plot!
Every customer in our classification problem is a vector! The scatter plot from Chapter 3? That's a visualization of vectors in 2D space. Machine learning is really about finding patterns in high-dimensional vector spaces.
Click anywhere in the canvas to create vectors and see them come to life!
💡 Try this: Create several vectors and notice how they're all arrows pointing from the origin (center). Longer arrows = larger magnitude, different directions = different angles!
Every vector has two fundamental properties. Think of them like an arrow:
What it means: How long is the arrow? This tells us the "strength" or "intensity" of the customer's behavior.
Magnitude is the length of the vector arrow from origin to the endpoint
How we calculate it: Using the Pythagorean theorem (like finding the hypotenuse of a triangle):
Example: Customer A = [10, 35]
||A|| = √(10² + 35²) = √(100 + 1,225) = √1,325 ≈ 36.40
Example: Customer E = [2, 8]
||E|| = √(2² + 8²) = √(4 + 64) = √68 ≈ 8.25
Interpretation: Customer A has a magnitude of 36.40 while Customer E has only 8.25. This means Customer A shows much stronger engagement overall.
What it means: Which way is the arrow pointing? This tells us the "type" or "pattern" of behavior.
Direction shows where vectors point — A & B are similar, E is different
Why it matters: Two customers might have similar strength (magnitude) but completely different patterns (direction).
Example: Customer A [10, 35] and Customer B [12, 38] point in nearly the same direction
— both have high support tickets relative to purchases (both RENEW).
But Customer A [10, 35] and Customer E [2, 8] point in different directions,
indicating different behavioral patterns!
Now that we understand our customers are vectors, we need ways to compare them. But "compare" can mean different things depending on what you're trying to measure:
Sometimes you want to know both the direction AND the magnitude. Like in attention mechanisms: "Which words are pointing in a relevant direction AND have strong signals?"
Sometimes you only care about the direction, ignoring scale. Like our customer problem: "Do these customers follow the same behavioral pattern?"
Both operations power AI systems—recommendation engines, search, chatbots, and classification. Let's understand each tool and when to use it.
Each customer is an arrow (vector) starting from the origin. Similar customers have arrows pointing in the same direction.
Nearly parallel arrows
Both RENEW
Large angle between arrows
A: RENEW, E: CHURN
The key insight: Stand at the origin and look along each arrow. If two arrows point the same way, the customers are similar!
The dot product is a fundamental way to compare two vectors. It combines information about both their direction (alignment) and their magnitude (strength). The recipe is simple: multiply matching numbers, then add them all up.
Pair up the numbers, multiply each pair, add the results
Vector a = [3, 4]
Vector b = [2, 5]
First components: 3 × 2 = 6
Second components: 4 × 5 = 20
a · b = 6 + 20 = 26
The dot product combines two things: how aligned the vectors are (direction) and how large they are (magnitude).
Key insight: The dot product value depends on BOTH the angle between vectors AND their lengths. Two vectors perfectly aligned (0° angle) will have a larger dot product if they're longer.
Enter two vectors and watch the dot product calculation unfold step by step!
Cosine similarity measures only the direction (angle) between vectors, completely ignoring their magnitude. It answers the question: "Are these vectors pointing in the same direction?"
The key idea: normalize vectors to length 1 first, then take the dot product. This removes magnitude from the equation, leaving only directional information.
When all vectors have the same length, the only difference between their dot products comes from their direction (angle). Size no longer affects the result—you're purely comparing which way they point.
Same direction, different lengths
✓ Same length, now only direction matters!
Key idea: After normalization, both arrows have length 1. Any difference in their dot product now comes only from their direction, not their size.
The dot product
Measures alignment
Product of magnitudes
Removes the effect of size
Imagine you're a teacher grading how well students point in the right direction when asked "Where is north?"
Student A:
Points north with a 1.5 foot arm
Student B:
Also points north with a 2.5 foot arm
The question: Should Student B get a better grade just because they have a longer arm?
No! They both point in the same direction. Arm length shouldn't matter.
A unit vector is a vector that has been "normalized" to have a length (magnitude) of exactly 1. It keeps the same direction, but we standardize the length.
How? Divide each component by the magnitude:
û = v / ||v|| = [3/5, 4/5] = [0.6, 0.8]
Same direction, but now length = 1
By dividing the dot product by the magnitudes, we're essentially converting both vectors to unit vectors and then asking: "If both vectors had length 1, what would their dot product be?"
This removes magnitude, leaving only direction. We're normalizing both vectors to unit vectors, so we can compare their directions fairly — just like making all students' arms the same length before judging which way they point!
Note: For customer data (where all values are positive), cosine similarity is typically between 0 and 1. Negative values only appear if features can be negative.
The formula doesn't just happen to be called "cosine similarity" — it literally computes the cosine of the angle between the two vectors.
Same direction
Perfect similarity
Perpendicular
No similarity
Very small angle
High similarity
Opposite direction
Inverse relationship
Cosine Similarity = cos(angle between vectors)
The name isn't random — the formula literally computes the cosine of the geometric angle between arrows in vector space. This is why it perfectly measures direction while ignoring magnitude.
Vector a = [3, 4]
Vector b = [6, 8]
(Note: b is exactly 2× larger than a, same direction)
a · b = (3×6) + (4×8) = 18 + 32 = 50
||a|| = √(3² + 4²) = √(9 + 16) = √25 = 5
||b|| = √(6² + 8²) = √(36 + 64) = √100 = 10
||a|| × ||b|| = 5 × 10 = 50
Vectors a and b have cosine similarity of 1.0 (perfect similarity). Even though b is twice as large, they point in exactly the same direction!
Two customers are similar when their vectors (arrows) point in the same direction in feature space.
Multiply matching numbers, add them up. Captures BOTH alignment and size—perfect when both matter!
a · b = (a₁ × b₁) + (a₂ × b₂) + ... Normalize by dividing by magnitudes. Removes size, isolates direction.
cosine_similarity = (a · b) / (||a|| × ||b||) Cosine similarity measures the angle between vectors:
Adjust the vectors with sliders and watch how the angle and cosine similarity change in real-time!
Let's see cosine similarity applied to a concrete problem. Dot product applications are most clear in neural networks (attention mechanisms, learned weights), which we'll explore in later chapters.
Predict if customers will RENEW or CHURN based on their behavior: [support_tickets, usage_hours]. Different customers have different engagement levels (some use the product a lot, some don't).
Question: Should you use dot product or cosine similarity?
Let's compare what dot product and cosine similarity tell us about customer pairs:
| Customer Pair | Dot Product (direction × magnitude) | Cosine Similarity (direction only) | Actual Outcome |
|---|---|---|---|
| A · B | 1,450 | High | Both RENEW ✓ |
| A · C | 1,260 | High | Both RENEW ✓ |
| E · F | 42 | High | Both CHURN ✓ |
| A · E | 300 | Low | A renews, E churns |
Look at E · F: Their dot product is small (42) because both have low overall engagement (small vectors). But when we normalize by dividing by magnitudes, we remove the scale difference and focus purely on direction. Result: high cosine similarity—similar behavioral pattern.
E and F both churn not because they have low engagement, but because they follow a similar behavioral ratio (support tickets vs. usage hours). A machine learning classifier learns to recognize this pattern as predictive of churn. Cosine similarity lets us find customers with similar patterns regardless of their activity level.
Meanwhile, A · E have low cosine similarity—very different directions. A renews, E churns. They follow different behavioral patterns, which cosine similarity correctly identifies.
| Cosine Similarity | Label | Interpretation |
|---|---|---|
| 0.90–1.00 | High | Same pattern |
| 0.70–0.89 | Moderate | Somewhat similar |
| <0.70 | Low | Different pattern |
Important: These ranges are general guidelines only. Optimal thresholds vary significantly by domain, data distribution, and task requirements. Always validate with your specific data using metrics like precision-recall curves.
Use cosine similarity when you want to group by behavioral pattern, ignoring differences in scale or intensity.
Throughout Chapters 3 and 4, we've been plotting our customers on a 2D graph. Let's take a step back and really look at what we've created. Something profound is happening in this simple plot.
This scatter plot isn't just a convenient visualization — it represents a vector space, a mathematical structure where:
Key Insight: Clusters form automatically because customers with similar behaviors have similar vectors, which places them close together in this space. The mathematical structure reflects real-world patterns!
A vector space is a mathematical structure where:
Customer A = [10, 35] lives at coordinates (10, 35)
The point (10, 35) represents the vector [10, 35]
Distance = how different; Angle = how similar
Just a number line. One feature only.
Example: Just "months subscribed"
A plane. Two features.
Example: Our customer space!
Physical space. Three features.
Example: Add "support tickets"
Can't visualize, but same rules!
Example: 768D word embeddings
Our brains can only visualize 2D or 3D, but the mathematics works in ANY dimension. The distance formula, angles, clusters - all of it works exactly the same way in 768 dimensions as it does in our 2D customer plot!
Every dimension (axis) in our vector space represents one measurable trait or feature. Let's start with our simple example and build up to understand how LLMs work.
This dimension captures loyalty/tenure
This dimension captures engagement level
These 2 dimensions give us a 2D snapshot of customer behavior. But what if we added more?
In word embeddings, we don't manually choose what each dimension represents. Instead, the model learns which dimensions capture meaningful patterns. But we can interpret some of them:
Formality
"hey" → -0.8, "greetings" → +0.7
Sentiment
"terrible" → -0.9, "wonderful" → +0.9
Gender Association
"queen" → -0.6, "king" → +0.6
Time/Tense
"was" → -0.7, "will" → +0.7
764 More Dimensions
Each capturing subtle patterns like:
768 dimensions isn't scary - it's just 768 different traits measured at once! Each dimension is like one column in a spreadsheet. More columns = more detailed description.
Our customers had 2 traits (months, usage). Words have 768 traits (formality, sentiment, gender, tense, abstractness, and 763 more). That's it!
Look back at our scatter plot. The RENEW and CHURN customers naturally form clusters. This isn't a coincidence - it's geometry revealing underlying patterns.
The "average" customer in the cluster
RENEW Centroid:
Average of A, B, C, D = [(10+12+14+11)/4, (35+38+32+33)/4]
= [11.75, 34.5]
How spread out are the points?
RENEW cluster: Low variance (tight cluster)
Points are all similar to each other
How far apart are the clusters?
RENEW centroid to CHURN centroid: Large!
This is why classification works!
In our 2D space, we can see the clusters visually. But the same concept applies in 768 dimensions! Word embeddings cluster by:
All point in similar directions in 768D space
Different cluster, different region of space
Yet another cluster for abstract concepts
Just like our RENEW and CHURN customers cluster because they share behavioral patterns, words cluster because they share semantic patterns!
Core concepts from this chapter:
Dot product combines direction and magnitude. Cosine similarity isolates direction only. Choose based on your problem.
Dot product: multiply matching elements, add them up. Cosine similarity: dot product ÷ magnitudes.
LLMs use dot product in attention and cosine similarity for embeddings - understanding both is essential.
Dot product: larger when vectors align AND are large. Cosine similarity: ignores size, measures angle between vectors.
Vectors aren't just numbers - they're points in space. Similar vectors cluster together, and this geometry naturally reveals patterns in data.
Each dimension represents one measurable trait. Our 2D customer space has 2 traits. Word embeddings have 768 traits. Same math, more detail.
We can't visualize 768D, but the mathematics is identical. Distance, angles, and clusters work exactly the same in any dimension.
The Foundation:
Vectors represent data as points in space. Dot product measures alignment and magnitude. Cosine similarity measures pure direction.
These simple operations power everything from search engines to language models. Master these fundamentals, and deep learning becomes clear.
Test what you've learned about vectors and deep learning!