From predicting numbers to predicting yes/no answers
In Chapters 1 and 2, we learned to predict continuous numbers: house prices like $400k, $250k, $500k. We used MSE (Mean Squared Error) as our loss function because we were measuring how far off our predictions were from actual numbers.
But what if instead we wanted to predict yes/no answers? Categories instead of numbers?
Question: How much?
Answer: A continuous number
Example: House price = $450k
Loss Function: MSE (Mean Squared Error)
Question: Which category?
Answer: A label (Yes/No, A/B/C)
Example: Email = Spam or Not Spam
Loss Function: Cross-Entropy
Predict a NUMBER:
Predict a CATEGORY:
Imagine we run a subscription service. We have data about our customers: how many months they've been with us and how often they use our service. We want to predict: Will they RENEW or CHURN (cancel)?
Each point represents a customer. Green = Renewed, Red = Churned
| Customer | Months Subscribed (x1) | Usage hrs/week (x2) | Outcome (y) |
|---|---|---|---|
| A | 10 | 35 | RENEW |
| B | 12 | 38 | RENEW |
| C | 14 | 32 | RENEW |
| D | 11 | 28 | RENEW |
| E | 2 | 8 | CHURN |
| F | 1 | 5 | CHURN |
| G | 3 | 12 | CHURN |
| H | 2.5 | 6 | CHURN |
Notice how customers who renew cluster in one region (longer subscription, higher usage) while those who churn cluster in another (shorter subscription, lower usage).
Just like in regression, the machine uses weights and bias. But instead of predicting a number, it's trying to draw a line (decision boundary) that separates Renew from Churn customers.
Many misclassifications! The line is wrong. Several customers are on the wrong side!
Count how many customers are on the wrong side of the line. Each mistake increases the error.
The machine tweaks the weights and bias to rotate and shift the line, trying to reduce misclassifications. Here's how:
The gradient tells us which direction to move each weight to reduce errors. Think of it as a compass pointing toward "less wrong" predictions.
The learning rate controls how big of a step we take in that direction.
Example: Learning rate = 0.01 means we move 1% of the gradient's recommendation each step
new_weight = old_weight - (learning_rate × gradient) We subtract because we want to go downhill (reduce error), and the gradient points in the direction of steepest increase.
Line separates the clusters well! All 8 customers correctly classified.
If Decision > 0 → Predict RENEW
If Decision ≤ 0 → Predict CHURN
Watch the decision boundary adjust itself to correctly classify all customers
The interesting thing is that multiple different lines can separate the data reasonably well. Depending on where gradient descent starts and how it progresses, the machine might find different solutions!
Line A: Steep slope
Accuracy: 100% (8/8 correct)
Line B: Medium slope
Accuracy: 100% (8/8 correct)
Line C: Gentle slope
Accuracy: 100% (8/8 correct)
All three lines correctly classify our 8 training customers - they all achieve 100% accuracy on the training data! But each line makes slightly different predictions for new customers not in our dataset. A customer near the boundary might be classified as RENEW by one model but CHURN by another. This is why:
Classification is about finding a decision boundary that separates clusters. The machine learns by adjusting weights and bias through gradient descent, trying many iterations until it finds a line that minimizes errors. But there's no single "perfect" answer—just different trade-offs between different types of mistakes.
The decision boundary we've been calling a "line" has a formal mathematical name: a hyperplane. This term might sound intimidating, but it's actually quite simple once you see the pattern.
With 2 features (months, usage), the hyperplane is a line. This is what we've been working with!
With 3 features (add "support tickets"), the hyperplane becomes a flat plane cutting through 3D space.
With 768 features (word embeddings), the hyperplane is a 767-dimensional surface. Can't visualize it, but the math works identically!
Notice that all three decision boundaries follow the same formula we learned earlier:
w₁×x₁ + w₂×x₂ + w₃×x₃ + ... + wₙ×xₙ + bias = 0
In 2D: This equation defines a line
In 3D: This equation defines a plane
In nD: This equation defines a hyperplane
Same formula. Same gradient descent. Same learning process. Just more dimensions!
This section covers an elegant mathematical technique used in ML libraries. It's not essential for understanding classification, but it's interesting if you're curious about implementation details!
You might have noticed something slightly awkward about our formula:
w₁×x₁ + w₂×x₂ + w₃×x₃ + ... + wₙ×xₙ + bias = 0 The bias term looks a bit tacked on—all the other terms multiply a weight by a feature, but the bias just... sits there at the end. There's an elegant way to fix this!
Instead of keeping bias separate, we can absorb it into the weight vector by introducing a "dummy" feature x₀ that is always equal to 1.
Let: x₀ = 1 (always)
Let: w₀ = bias
Now our equation becomes perfectly uniform:
w₀×x₀ + w₁×x₁ + w₂×x₂ + w₃×x₃ + ... + wₙ×xₙ = 0 Since x₀ = 1, the term w₀×x₀ = w₀×1 = bias, so we haven't changed the math—we've just repackaged it more elegantly.
This technique (called "augmented feature space" or "homogeneous coordinates") provides several benefits:
Note: You might see this written compactly as w·x or wTx in machine learning papers—we'll explore what these notations mean when we dive into vectors and matrices in upcoming chapters.
💡 Practical Note: In practice, many machine learning libraries handle this automatically behind the scenes. When you specify a model with bias, the library is likely using this augmented feature representation internally. Now you know the trick they're using!
We've covered the core building blocks for classification:
Hyperplanes that separate categories (renew vs churn)
Cross-entropy measures prediction errors for categories
Adjusting weights to minimize loss on training data
With these pieces, we can build a classifier, train it on data, and watch it learn.
We just saw how a model can memorize training data (95% accuracy) but fail on new data (62% accuracy).
But here's the problem: You can't wait until April—when you deploy the model on real customers—to discover it doesn't work!
We need to know if the model generalizes BEFORE we deploy it.
How? By pretending some of our historical data is "new" and testing the model on it!
Instead of training on ALL 500 customers from January-March, we deliberately hold some back:
The model learns patterns from these customers
The model has NEVER seen these during training. We test on them to see if it can generalize.
The key insight: If the model performs well on the test set (which it never saw during training), we have evidence it learned patterns, not memorization. If it performs poorly, we know it overfitted BEFORE deploying to production!
Let's see this in action. Here are three different models trained on the same data. Watch how their performance on training vs test data reveals everything:
Each model trains on 400 customers, then we test all three on the held-out 100 customers they've never seen:
The Smoking Gun: Huge gap between training (98%) and test (58%). This model memorized specific customer details instead of learning general churn patterns.
Perfect! Similar performance on both sets (87% vs 84%). Small gap means it learned patterns that generalize. This is what we want!
Underfitted: Consistent but poor performance (63% vs 62%). Model too simple to capture churn patterns. Need more complexity!
To know if a model truly works, we need to test it on data it has never seen during training.
70 customers
Used to learn weights via gradient descent
Like: Studying with past exam questions
15 customers
Used to tune hyperparameters and prevent overfitting
Like: Practice tests to check if we're ready
15 customers
Used ONLY ONCE at the end to measure final performance
Like: The actual exam we take once
See how we split our 8 customers into different sets. Click shuffle to randomize the split!
The test set represents future unseen data that the model has never encountered.
If we peek at test performance during training and adjust our model, we're essentially "teaching to the test" - the model will memorize patterns specific to the test set instead of learning general patterns.
The test set gives us an honest answer: "How will this model perform in the real world?"
Rule: Touch the test set ONLY ONCE at the very end, after all training and tuning is complete.
Adjust model complexity and training epochs to see overfitting happen in real-time
Model complexity refers to how flexible and powerful a model is. Think of it like drawing a line vs. drawing a wiggly curve:
How to identify: Check the number of parameters (weights) - more parameters = more complex
An epoch is one complete pass through all the training data. Training for multiple epochs means the model sees the same data multiple times and keeps learning from it.
How to identify: Monitor when training accuracy keeps improving but test accuracy stops improving or gets worse
Moderate complexity with reasonable regularization. Should generalize well.
How to detect overfitting: Check if there's a large gap between training accuracy and test accuracy
What good generalization looks like: Training 87%, Test 84% → Small gap means the model learned real patterns
Let's say you've built a model that shows good generalization:
Great! No overfitting. But now you need to actually use this model in production. Before deployment, leadership will want to know:
"89% accuracy sounds good, but what does that actually mean for our business?"
Knowing your model generalizes (doesn't overfit) is essential.
But to deploy it in production, you need to know what it's actually good at.
Let's look at a model that passed our overfitting test (small train/test gap). Now let's dig deeper into what kinds of predictions it's making.
Our Model's Accuracy: 95%
Wow! That sounds amazing, right?
Here's the surprising part: a completely useless model could also get 95% accuracy!
The Dumbest Possible Model:
def predict(customer):
return "RENEW" # Always predict RENEW for everyone
This model always predicts "RENEW" for every customer. Let's see what happens:
Accuracy: 95 / 100 = 95%
The same 95% accuracy! But this model is completely useless—it never catches a single churner. We'd lose millions in revenue from customers we could have saved.
Accuracy only tells us "how many did we get right overall", but what we actually need to know is:
We need a better way to see what's really happening.
A confusion matrix is simply a table that compares what the model predicted versus what actually happened.
Think of it this way: You have 100 customers. For each one, two things happen:
"I think this customer will CHURN"
or
"I think this customer will RENEW"
Customer actually CHURNED
or
Customer actually RENEWED
The confusion matrix organizes all 100 customers into a 2×2 grid based on these two questions:
|
Model Predicted: CHURN |
Model Predicted: RENEW | |
|
Actually CHURNED |
Correct! (True Positive) |
Wrong (False Negative) |
|
Actually RENEWED |
Wrong (False Positive) |
Correct! (True Negative) |
That's it! The confusion matrix doesn't show complex math—it just counts how many customers fall into each of these four boxes. Once we have these counts, we can calculate metrics like Precision and Recall that tell us exactly what the model is good at.
Now let's see what this looks like with actual numbers. Instead of one number (accuracy), we'll look at four numbers that tell us exactly what's happening:
We organize these 4 outcomes into a simple 2×2 table. This makes it easy to see patterns:
Now we can see the full picture! The model got 93 predictions correct (90 + 3), but missed 2 out of 5 churners. That's the insight accuracy alone couldn't show us.
What it means: Out of 100 customers, we got the right prediction 93 times.
⚠️ Why it's misleading: We got 93% right overall, but missed 2 out of 5 churners! We lost valuable customers we could have saved.
What it means: When we predict churn, there's only a 37.5% chance they'll actually churn.
This matters when: Reaching out to customers is expensive or might annoy happy customers
What it means: Out of 5 customers who actually churned, we predicted 3 and missed 2.
This matters when: Missing a churner means losing a valuable customer we could have saved
We're comparing two churn prediction models. Both cost the same. We can only choose one.
Result: We contact way too many happy customers. They get annoyed and... actually churn!
Result: Efficient but we lose valuable customers we could have saved.
We're stuck. Each model is great at ONE thing but terrible at the other. We need one number that only gives high scores to balanced models.
What it means: One number that only gives high scores to models that do BOTH jobs well—catching churners AND avoiding false alarms.
This matters when: We need to compare models fairly. A model that predicts everyone will churn gets a terrible F1 score (too many false alarms). So does one that never predicts churn (misses everyone). F1 punishes one-sided models.
Different situations call for different priorities. The right metric depends on the business impact of each type of error:
The key insight: Accuracy alone isn't enough. Understanding what types of errors matter most for your specific problem is crucial. There's no "best" metric—only the right metric for your business priorities.
The concepts from this chapter give you the tools to make informed decisions when evaluating, testing, and deploying AI systems. Here's how:
An AI vendor shows you their customer service agent achieving "92% accuracy" in their demo.
Accuracy alone is misleading with imbalanced data. You need to see the confusion matrix.
Ask specific questions:
You're applying the same critical thinking from this chapter: Don't accept summary metrics. Understand what the model is actually good at.
You're about to deploy an AI agent that worked perfectly in the vendor's test environment.
Models can overfit to their training data. You must test on NEW data to detect this.
Test on YOUR data:
This is exactly the train/test split concept—vendor's demo is "training data," your conversations are "test data."
Your VP wants to know: "Is this AI agent worth the investment?"
Different metrics matter for different business priorities. Precision vs Recall represents a real trade-off.
Translate metrics to business impact:
You're using the precision/recall framework to frame business trade-offs leadership can understand.
Adjust the values to see how different prediction patterns affect accuracy, precision, recall, and F1-score
Linear classification works when categories can be separated by straight lines. But what happens when patterns are more complex—arranged in circles, nested regions, or intertwined in ways a single line can't capture?
And we've learned to detect overfitting, but how do we actively prevent it during training?
These questions lead us to neural networks, activation functions, and regularization techniques—the building blocks of modern deep learning. We'll explore these in upcoming chapters.
Core concepts from this chapter:
Regression predicts continuous numbers (house price: $450k). Classification predicts discrete categories (email: spam or not spam). Use classification when the answer is a label, not a number.
Classification models learn a decision boundary (like a line or curve) that separates different categories in the feature space. Points on one side get classified one way, points on the other side get classified differently.
Different decision boundaries can separate the same data reasonably well. Gradient descent might find different solutions depending on where it starts. This is normal in machine learning.
When classes are imbalanced (95 renewed, 5 churned), even a useless model can achieve high accuracy. The Confusion Matrix reveals what's really happening by showing True Positives, False Positives, True Negatives, and False Negatives.
With 2 features, the boundary is a line. With 3 features, it's a plane. With more features, it's a hyperplane. The math extends naturally to any number of dimensions.
Models can memorize training data (like memorizing "Customer #10234 will churn") instead of learning patterns (like "customers with low usage and high support tickets tend to churn"). This is called overfitting.
Split your data into training and test sets. A large gap between training accuracy (98%) and test accuracy (58%) means overfitting. A small gap (87% vs 84%) means good generalization.
Accuracy isn't enough for imbalanced data. Use Recall when missing positives is costly (churn, cancer screening). Use Precision when false alarms are expensive (fraud detection). Use F1 when you need balance (spam filters).
Key Decision:
Predicting numbers? Use Regression. Predicting categories? Use Classification.
Both use the same core concepts: weights, bias, gradient descent, and loss functions. Only the output type and loss function change.
Test what you've learned about classification!