From predicting numbers to predicting yes/no answers
In Chapters 1 and 2, we learned to predict continuous numbers: house prices like $400k, $250k, $500k. We used MSE (Mean Squared Error) as our loss function because we were measuring how far off our predictions were from actual numbers.
But what if instead we wanted to predict yes/no answers? Categories instead of numbers?
Question: How much?
Answer: A continuous number
Example: House price = $450k
Loss Function: MSE (Mean Squared Error)
Question: Which category?
Answer: A label (Yes/No, A/B/C)
Example: Email = Spam or Not Spam
Loss Function: Cross-Entropy
Predict a NUMBER:
Predict a CATEGORY:
Imagine we run a subscription service. We have data about our customers: how many months they've been with us and how often they use our service. We want to predict: Will they RENEW or CHURN (cancel)?
Each point represents a customer. Green = Renewed, Red = Churned
| Customer | Months Subscribed (x1) | Usage hrs/week (x2) | Outcome (y) |
|---|---|---|---|
| A | 10 | 35 | RENEW |
| B | 12 | 38 | RENEW |
| C | 14 | 32 | RENEW |
| D | 11 | 28 | RENEW |
| E | 2 | 8 | CHURN |
| F | 1 | 5 | CHURN |
| G | 3 | 12 | CHURN |
| H | 2.5 | 6 | CHURN |
Notice how customers who renew cluster in one region (longer subscription, higher usage) while those who churn cluster in another (shorter subscription, lower usage).
Just like in regression, the machine uses weights and bias. But instead of predicting a number, it's trying to draw a line (decision boundary) that separates Renew from Churn customers.
Many misclassifications! The line is wrong. Several customers are on the wrong side!
Count how many customers are on the wrong side of the line. Each mistake increases the error.
The machine tweaks the weights and bias to rotate and shift the line, trying to reduce misclassifications. Here's how:
The gradient tells us which direction to move each weight to reduce errors. Think of it as a compass pointing toward "less wrong" predictions.
The learning rate controls how big of a step we take in that direction.
Example: Learning rate = 0.01 means we move 1% of the gradient's recommendation each step
new_weight = old_weight - (learning_rate × gradient) We subtract because we want to go downhill (reduce error), and the gradient points in the direction of steepest increase.
Line separates the clusters well! All 8 customers correctly classified.
If Decision > 0 → Predict RENEW
If Decision ≤ 0 → Predict CHURN
Watch the decision boundary adjust itself to correctly classify all customers
The interesting thing is that multiple different lines can separate the data reasonably well. Depending on where gradient descent starts and how it progresses, the machine might find different solutions!
Line A: Steep slope
Accuracy: 100% (8/8 correct)
Line B: Medium slope
Accuracy: 100% (8/8 correct)
Line C: Gentle slope
Accuracy: 100% (8/8 correct)
All three lines correctly classify our 8 training customers - they all achieve 100% accuracy on the training data! But each line makes slightly different predictions for new customers not in our dataset. A customer near the boundary might be classified as RENEW by one model but CHURN by another. This is why:
Classification is about finding a decision boundary that separates clusters. The machine learns by adjusting weights and bias through gradient descent, trying many iterations until it finds a line that minimizes errors. But there's no single "perfect" answer—just different trade-offs between different types of mistakes.
The decision boundary we've been calling a "line" has a formal mathematical name: a hyperplane. This term might sound intimidating, but it's actually quite simple once you see the pattern.
With 2 features (months, usage), the hyperplane is a line. This is what we've been working with!
With 3 features (add "support tickets"), the hyperplane becomes a flat plane cutting through 3D space.
With 768 features (word embeddings), the hyperplane is a 767-dimensional surface. Can't visualize it, but the math works identically!
Notice that all three decision boundaries follow the same formula we learned earlier:
w₁×x₁ + w₂×x₂ + w₃×x₃ + ... + wₙ×xₙ + bias = 0
In 2D: This equation defines a line
In 3D: This equation defines a plane
In nD: This equation defines a hyperplane
Same formula. Same gradient descent. Same learning process. Just more dimensions!
We've covered the core building blocks for classification:
Hyperplanes that separate categories (renew vs churn)
Cross-entropy measures prediction errors for categories
Adjusting weights to minimize loss on training data
With these pieces, we can build a classifier, train it on data, and watch it learn.
We've trained a churn prediction model on customer data and it performs perfectly! But here's the real question: Will it work on new customers it has never seen before?
Imagine testing three churn prediction models. Each model trains on 100 past customers, then gets tested on 100 completely new customers. Here's what happens:
Overfitted! Memorized training data, can't generalize to new customers
Generalized! Learned churn patterns, not specific customers
Underfitted! Model too simple to learn churn patterns
Overfitting: The model memorizes the training data (100% accuracy) but fails on new data (25%).
Generalizing: The model learns the pattern and performs similarly on both (87% vs 84%).
Underfitting: The model is too simple and performs poorly on everything (~50%).
To know if a model truly works, we need to test it on data it has never seen during training.
70 customers
Used to learn weights via gradient descent
Like: Studying with past exam questions
15 customers
Used to tune hyperparameters and prevent overfitting
Like: Practice tests to check if we're ready
15 customers
Used ONLY ONCE at the end to measure final performance
Like: The actual exam we take once
See how we split our 8 customers into different sets. Click shuffle to randomize the split!
The test set represents future unseen data that the model has never encountered.
If we peek at test performance during training and adjust our model, we're essentially "teaching to the test" - the model will memorize patterns specific to the test set instead of learning general patterns.
The test set gives us an honest answer: "How will this model perform in the real world?"
Rule: Touch the test set ONLY ONCE at the very end, after all training and tuning is complete.
Adjust model complexity and training epochs to see overfitting happen in real-time
Model complexity refers to how flexible and powerful a model is. Think of it like drawing a line vs. drawing a wiggly curve:
How to identify: Check the number of parameters (weights) - more parameters = more complex
An epoch is one complete pass through all the training data. Training for multiple epochs means the model sees the same data multiple times and keeps learning from it.
How to identify: Monitor when training accuracy keeps improving but test accuracy stops improving or gets worse
Moderate complexity with reasonable regularization. Should generalize well.
We've built a churn prediction model. Time to measure how well it works. Let's test it on 100 customers.
Our Model's Accuracy: 95%
Wow! That sounds amazing, right?
Here's the surprising part: a completely useless model could also get 95% accuracy!
The Dumbest Possible Model:
def predict(customer):
return "RENEW" # Always predict RENEW for everyone
This model always predicts "RENEW" for every customer. Let's see what happens:
Accuracy: 95 / 100 = 95%
The same 95% accuracy! But this model is completely useless—it never catches a single churner. We'd lose millions in revenue from customers we could have saved.
Accuracy only tells us "how many did we get right overall", but what we actually need to know is:
We need a better way to see what's really happening.
Instead of one number (accuracy), let's look at four numbers that tell us exactly what's happening. The model makes predictions, and then reality happens. There are only 4 possible outcomes:
We organize these 4 outcomes into a simple 2×2 table. This makes it easy to see patterns:
Now we can see the full picture! The model got 93 predictions correct (90 + 3), but missed 2 out of 5 churners. That's the insight accuracy alone couldn't show us.
What it means: Out of 100 customers, we got the right prediction 93 times.
⚠️ Why it's misleading: We got 93% right overall, but missed 2 out of 5 churners! We lost valuable customers we could have saved.
What it means: When we predict churn, there's only a 37.5% chance they'll actually churn.
This matters when: Reaching out to customers is expensive or might annoy happy customers
What it means: Out of 5 customers who actually churned, we predicted 3 and missed 2.
This matters when: Missing a churner means losing a valuable customer we could have saved
We're comparing two churn prediction models. Both cost the same. We can only choose one.
Result: We contact way too many happy customers. They get annoyed and... actually churn!
Result: Efficient but we lose valuable customers we could have saved.
We're stuck. Each model is great at ONE thing but terrible at the other. We need one number that only gives high scores to balanced models.
What it means: One number that only gives high scores to models that do BOTH jobs well—catching churners AND avoiding false alarms.
This matters when: We need to compare models fairly. A model that predicts everyone will churn gets a terrible F1 score (too many false alarms). So does one that never predicts churn (misses everyone). F1 punishes one-sided models.
Different situations call for different priorities:
The key insight: Accuracy alone isn't enough. Understanding what types of errors matter most for the specific problem is crucial.
Adjust the values to see how different prediction patterns affect accuracy, precision, recall, and F1-score
We now understand overfitting and how to measure model performance. But we haven't covered:
Techniques to prevent overfitting: L1, L2, dropout
Question: How do we force models to generalize instead of memorize?
Adam optimizer, learning rate schedules, batch vs mini-batch training
Question: Why does ChatGPT train in weeks, not years?
Core concepts from this chapter:
Regression predicts continuous numbers (house price: $450k). Classification predicts discrete categories (email: spam or not spam). Use classification when the answer is a label, not a number.
Classification models learn a decision boundary (like a line or curve) that separates different categories in the feature space. Points on one side get classified one way, points on the other side get classified differently.
Different decision boundaries can separate the same data reasonably well. Gradient descent might find different solutions depending on where it starts. This is normal in machine learning.
When classes are imbalanced (95 renewed, 5 churned), even a useless model can achieve high accuracy. The Confusion Matrix reveals what's really happening by showing True Positives, False Positives, True Negatives, and False Negatives.
With 2 features, the boundary is a line. With 3 features, it's a plane. With more features, it's a hyperplane. The math extends naturally to any number of dimensions.
Models can perform well on training data but poorly on new data (overfitting). Always test on data the model hasn't seen to measure real-world performance.
Key Decision:
Predicting numbers? Use Regression. Predicting categories? Use Classification.
Both use the same core concepts: weights, bias, gradient descent, and loss functions. Only the output type and loss function change.
Test what you've learned about classification!