Classification: Making Decisions

From predicting numbers to predicting yes/no answers

From Numbers to Decisions

What If We Don't Want a Number?

In Chapters 1 and 2, we learned to predict continuous numbers: house prices like $400k, $250k, $500k. We used MSE (Mean Squared Error) as our loss function because we were measuring how far off our predictions were from actual numbers.

But what if instead we wanted to predict yes/no answers? Categories instead of numbers?

Regression (Chapters 1-2)

Question: How much?

Answer: A continuous number

Example: House price = $450k

Loss Function: MSE (Mean Squared Error)

Classification (This Chapter)

Question: Which category?

Answer: A label (Yes/No, A/B/C)

Example: Email = Spam or Not Spam

Loss Function: Cross-Entropy

Regression vs Classification Examples

Regression Tasks

Predict a NUMBER:

House price: $450,000
Temperature: 72.5°F
Stock price: $152.30
Time to arrival: 23 minutes
Sales forecast: $1.2M

⊞ Classification Tasks

Predict a CATEGORY:

Email: Spam or Not Spam
Image: Dog, Cat, or Bird
Sentiment: Positive or Negative
Diagnosis: Healthy or Sick
Transaction: Fraud or Legitimate

Real Example: Customer Churn Prediction

Will This Customer Renew or Churn?

Imagine we run a subscription service. We have data about our customers: how many months they've been with us and how often they use our service. We want to predict: Will they RENEW or CHURN (cancel)?

Customer Data: Plotting on X-Y Axes

Each point represents a customer. Green = Renewed, Red = Churned

These 8 customers (A through H) will appear in all visualizations below to show how the decision boundary evolves.

Usage Frequency (hours/week)

40 30 20 10 0

0 3 6 9 12 15

Months Subscribed

The Training Data

Customer	Months Subscribed (x1)	Usage hrs/week (x2)	Outcome (y)
A	10	35	RENEW
B	12	38	RENEW
C	14	32	RENEW
D	11	28	RENEW
E	2	8	CHURN
F	1	5	CHURN
G	3	12	CHURN
H	2.5	6	CHURN

The Pattern

Notice how customers who renew cluster in one region (longer subscription, higher usage) while those who churn cluster in another (shorter subscription, lower usage).

How The Machine Learns to Separate

Finding the Decision Boundary

Just like in regression, the machine uses weights and bias. But instead of predicting a number, it's trying to draw a line (decision boundary) that separates Renew from Churn customers.

Start with Random Weights

Many misclassifications! The line is wrong. Several customers are on the wrong side!

↓

Calculate Errors

Count how many customers are on the wrong side of the line. Each mistake increases the error.

Predicted CHURN but actually RENEWED

Predicted RENEW but actually CHURNED

Total Error: HIGH

↓

Adjust Weights & Bias (Gradient Descent)

The machine tweaks the weights and bias to rotate and shift the line, trying to reduce misclassifications. Here's how:

What is the Gradient?

The gradient tells us which direction to move each weight to reduce errors. Think of it as a compass pointing toward "less wrong" predictions.

If gradient is positive → decrease the weight
If gradient is negative → increase the weight
Calculated using calculus (derivatives) from our errors

What is the Learning Rate?

The learning rate controls how big of a step we take in that direction.

Too large → We overshoot and bounce around, never converging
Too small → Learning is very slow, takes forever
Just right → Steady progress toward the best weights

Example: Learning rate = 0.01 means we move 1% of the gradient's recommendation each step

new_weight = old_weight - (learning_rate × gradient)

We subtract because we want to go downhill (reduce error), and the gradient points in the direction of steepest increase.

↓

After Many Iterations...

Line separates the clusters well! All 8 customers correctly classified.

The Learned Model

Decision = w1×(Months) + w2×(Usage) + bias

If Decision > 0 → Predict RENEW
If Decision ≤ 0 → Predict CHURN

How the Model Learns to Separate Data

Watch the decision boundary adjust itself to correctly classify all customers

Step

Errors

Loss

1.00

Current Weights

w₁ (Months) 0.0

w₂ (Usage) 0.0

Bias -10

Step 1: Random Start

Step 2: Learning...

Step 3: Improving...

Step 4: Perfect!

A Curious Discovery: Multiple Valid Solutions

Why Predictions Vary

The interesting thing is that multiple different lines can separate the data reasonably well. Depending on where gradient descent starts and how it progresses, the machine might find different solutions!

Three Different Valid Boundaries

Line A: Steep slope

Accuracy: 100% (8/8 correct)

Line B: Medium slope

Accuracy: 100% (8/8 correct)

Line C: Gentle slope

Accuracy: 100% (8/8 correct)

Why This Matters

All three lines correctly classify our 8 training customers - they all achieve 100% accuracy on the training data! But each line makes slightly different predictions for new customers not in our dataset. A customer near the boundary might be classified as RENEW by one model but CHURN by another. This is why:

Machine learning models aren't perfect - many solutions can fit the training data
Different training runs can produce different models (depending on initial random weights)
Borderline cases are inherently uncertain
We need to test models on new data to pick the best one for future predictions

Key Insight

Classification is about finding a decision boundary that separates clusters. The machine learns by adjusting weights and bias through gradient descent, trying many iterations until it finds a line that minimizes errors. But there's no single "perfect" answer—just different trade-offs between different types of mistakes.

From Lines to Hyperplanes: Scaling to Higher Dimensions

The Formal Name: Hyperplane

The decision boundary we've been calling a "line" has a formal mathematical name: a hyperplane. This term might sound intimidating, but it's actually quite simple once you see the pattern.

How Hyperplanes Scale Across Dimensions

2D Space

Hyperplane = Line

With 2 features (months, usage), the hyperplane is a line. This is what we've been working with!

3D Space

Hyperplane = Plane

With 3 features (add "support tickets"), the hyperplane becomes a flat plane cutting through 3D space.

nD Space

...

Hyperplane = (n-1)D Surface

With 768 features (word embeddings), the hyperplane is a 767-dimensional surface. Can't visualize it, but the math works identically!

The Mathematical Pattern

Notice that all three decision boundaries follow the same formula we learned earlier:

w₁×x₁ + w₂×x₂ + w₃×x₃ + ... + wₙ×xₙ + bias = 0

In 2D: This equation defines a line
In 3D: This equation defines a plane
In nD: This equation defines a hyperplane

Same formula. Same gradient descent. Same learning process. Just more dimensions!

From Building to Testing

Where We Are Now

We've covered the core building blocks for classification:

Decision boundaries

Hyperplanes that separate categories (renew vs churn)

Loss functions

Cross-entropy measures prediction errors for categories

Learning via gradient descent

Adjusting weights to minimize loss on training data

With these pieces, we can build a classifier, train it on data, and watch it learn.

The Next Question

Imagine training a churn model on 100 customers. After training, it predicts perfectly: 100% accuracy on those same 100 customers.

The next month, we run it on 100 different customers to predict who will churn.

The accuracy drops to 25%.

What happened?

The model learned to recognize the specific 100 customers in the training data, not the general patterns that predict churn.

When it encountered customer #101—someone it had never seen before—it couldn't apply what it had "learned."

This gap between performance on training data versus new data is called overfitting.

It's the fundamental challenge in machine learning: How do we build models that generalize beyond the examples they were trained on?

Testing Reality: Does the Model Actually Work?

The Three Models

We've trained a churn prediction model on customer data and it performs perfectly! But here's the real question: Will it work on new customers it has never seen before?

📊 Three Prediction Models, Three Different Behaviors

Imagine testing three churn prediction models. Each model trains on 100 past customers, then gets tested on 100 completely new customers. Here's what happens:

Model A: The Memorizer

Training Customers: 100% Perfect
Memorized every detail of the 100 training customers

New Customers: 25% Terrible
Can't recognize churn patterns in new customers—just memorized the old ones

Overfitted! Memorized training data, can't generalize to new customers

Model B: The Learner

Training Customers: 87% Great
Good but not perfect—learned general churn patterns

New Customers: 84% Great
Similar performance on completely new customers!

Generalized! Learned churn patterns, not specific customers

Model C: The Simplistic

Training Customers: 52% Mediocre
Too simple—can't even learn from training data

New Customers: 51% Mediocre
Consistently poor, barely better than random guessing

Underfitted! Model too simple to learn churn patterns

The Key Insight

Overfitting: The model memorizes the training data (100% accuracy) but fails on new data (25%).
Generalizing: The model learns the pattern and performs similarly on both (87% vs 84%).
Underfitting: The model is too simple and performs poorly on everything (~50%).

The Solution: Train, Validation, and Test Sets

To know if a model truly works, we need to test it on data it has never seen during training.

How We Split Our Data

All Available Data (100 customers)

🎓

Training Set (70%)

70 customers

Used to learn weights via gradient descent

Like: Studying with past exam questions

📝

Validation Set (15%)

15 customers

Used to tune hyperparameters and prevent overfitting

Like: Practice tests to check if we're ready

🏆

Test Set (15%)

15 customers

Used ONLY ONCE at the end to measure final performance

Like: The actual exam we take once

Golden Rules

Never train on validation or test data — That's cheating! Like studying the actual exam questions.
Never touch test data until the very end — Once we evaluate on test data, we can't improve the model anymore.
Use validation data to compare models — Try different approaches, pick the best one based on validation performance.

Interactive Train/Validation/Test Split

See how we split our 8 customers into different sets. Click shuffle to randomize the split!

Training: 5 customers (62.5%)

Validation: 2 customers (25%)

Test: 1 customer (12.5%)

⚠️ Why is the Test Set LOCKED?

The test set represents future unseen data that the model has never encountered.

If we peek at test performance during training and adjust our model, we're essentially "teaching to the test" - the model will memorize patterns specific to the test set instead of learning general patterns.

The test set gives us an honest answer: "How will this model perform in the real world?"

Rule: Touch the test set ONLY ONCE at the very end, after all training and tuning is complete.

🎓 Training Set

For learning

📊 Validation Set

For tuning

🔒 Test Set

LOCKED until end

Overfitting: When Models Memorize Instead of Learn

Adjust model complexity and training epochs to see overfitting happen in real-time

🔧 What is Model Complexity?

Model complexity refers to how flexible and powerful a model is. Think of it like drawing a line vs. drawing a wiggly curve:

Simple models (few parameters): Like drawing a straight line - can only capture basic patterns
Moderate models: Like a gently curving line - captures important patterns
Complex models (many parameters): Like a wild squiggly line - can fit every tiny detail

How to identify: Check the number of parameters (weights) - more parameters = more complex

🔄 What are Training Epochs?

An epoch is one complete pass through all the training data. Training for multiple epochs means the model sees the same data multiple times and keeps learning from it.

Too few epochs: Model hasn't learned enough - like studying for 5 minutes before an exam
Just right: Model learns general patterns that work on new data - both training and validation accuracy are high and similar, like understanding core concepts well enough to solve new problems
Too many epochs: Model starts memorizing training examples instead of learning patterns - like memorizing specific practice questions instead of understanding concepts

How to identify: Monitor when training accuracy keeps improving but test accuracy stops improving or gets worse

Model Complexity Moderate

Simple Moderate Complex

Training Epochs 50

Current Model:

Moderate complexity with reasonable regularization. Should generalize well.

✅ Good generalization

Training vs Test Accuracy

Training Accuracy

87%

Test Accuracy

84%

Gap (Overfitting)

Measuring Performance: Beyond Accuracy

We've built a churn prediction model. Time to measure how well it works. Let's test it on 100 customers.

📊 The Test Results

customers renewed

(this is normal - most customers stay)

customers churned

(these are the ones we need to catch!)

Our Model's Accuracy: 95%

Wow! That sounds amazing, right?

⚠️ But Wait... Here's What's Really Happening

Here's the surprising part: a completely useless model could also get 95% accuracy!

The Dumbest Possible Model:

def predict(customer):
return "RENEW" # Always predict RENEW for everyone

This model always predicts "RENEW" for every customer. Let's see what happens:

It predicts RENEW for the 95 customers who renewed → 95 correct ✓
It predicts RENEW for the 5 customers who churned → 5 wrong ✗

Accuracy: 95 / 100 = 95%

The same 95% accuracy! But this model is completely useless—it never catches a single churner. We'd lose millions in revenue from customers we could have saved.

🎯 The Real Question We Need to Answer

Accuracy only tells us "how many did we get right overall", but what we actually need to know is:

Did we catch the churners? (That's what saves revenue!)
What types of mistakes are we making?

We need a better way to see what's really happening.

💡 The Solution: The Confusion Matrix

Instead of one number (accuracy), let's look at four numbers that tell us exactly what's happening. The model makes predictions, and then reality happens. There are only 4 possible outcomes:

The 4 Possible Outcomes

✅

Model predicted: "Will RENEW"

What happened: Customer renewed ✓

CORRECT!

❌

Model predicted: "Will CHURN"

What happened: Customer renewed ✓

FALSE ALARM!

💀

Model predicted: "Will RENEW"

What happened: Customer churned ✗

MISSED! (Worst mistake)

✅

Model predicted: "Will CHURN"

What happened: Customer churned ✗

CORRECT! (Caught them)

The Confusion Matrix: Organizing These 4 Numbers

We organize these 4 outcomes into a simple 2×2 table. This makes it easy to see patterns:

Model Says: RENEW

Model Says: CHURN

Reality: Renewed

✅ Correct

❌ False Alarm

Reality: Churned

💀 Missed

✅ Caught Them!

Now we can see the full picture! The model got 93 predictions correct (90 + 3), but missed 2 out of 5 churners. That's the insight accuracy alone couldn't show us.

Accuracy: The Overall Score

(True Positives + True Negatives) / Total = (3 + 90) / 100 = 93%

What it means: Out of 100 customers, we got the right prediction 93 times.
⚠️ Why it's misleading: We got 93% right overall, but missed 2 out of 5 churners! We lost valuable customers we could have saved.

Precision: Are Our Churn Alerts Trustworthy?

True Positives / (True Positives + False Positives) = 3 / (3 + 5) = 37.5%

What it means: When we predict churn, there's only a 37.5% chance they'll actually churn.
This matters when: Reaching out to customers is expensive or might annoy happy customers

Recall: Do We Catch the Churners?

True Positives / (True Positives + False Negatives) = 3 / (3 + 2) = 60%

What it means: Out of 5 customers who actually churned, we predicted 3 and missed 2.
This matters when: Missing a churner means losing a valuable customer we could have saved

🤔 The Dilemma: Which Metric Do We Optimize?

We're comparing two churn prediction models. Both cost the same. We can only choose one.

Model X

Recall: 95% — Catches almost all churners!
Precision: 15% — But creates TONS of false alarms

Result: We contact way too many happy customers. They get annoyed and... actually churn!

Model Y

Precision: 98% — Rarely wrong when it predicts churn!
Recall: 30% — But misses most churners

Result: Efficient but we lose valuable customers we could have saved.

We're stuck. Each model is great at ONE thing but terrible at the other. We need one number that only gives high scores to balanced models.

F1-Score: The Answer to the Dilemma

2 × (Precision × Recall) / (Precision + Recall) = 2 × (0.375 × 0.6) / (0.375 + 0.6) = 46.2%

What it means: One number that only gives high scores to models that do BOTH jobs well—catching churners AND avoiding false alarms.
This matters when: We need to compare models fairly. A model that predicts everyone will churn gets a terrible F1 score (too many false alarms). So does one that never predicts churn (misses everyone). F1 punishes one-sided models.

Which Metric Matters?

Different situations call for different priorities:

Customer Churn: Maximize Recall — Missing a churner means losing revenue. We tolerate some false alarms (unnecessary retention calls).
Spam Filter: Balance Precision & Recall — Don't want important emails in spam, but also don't want spam in inbox. Use F1.
Medical Diagnosis: Maximize Recall — Better to have false positives that get rechecked than miss a disease.
Fraud Detection: High Precision — Don't want to freeze legitimate transactions and annoy customers.

The key insight: Accuracy alone isn't enough. Understanding what types of errors matter most for the specific problem is crucial.

Interactive Confusion Matrix Calculator

Adjust the values to see how different prediction patterns affect accuracy, precision, recall, and F1-score

Predicted: Churn

Predicted: Renew

Actually Churn

True Positive (TP)

False Negative (FN)

Actually Renew

False Positive (FP)

True Negative (TN)

Try these scenarios:

Accuracy

93%

(TP + TN) / Total

Precision

37.5%

TP / (TP + FP)

"Are our churn alerts trustworthy?"

Recall

60%

TP / (TP + FN)

"Do we catch all the churners?"

F1-Score

46.2%

2 × (Prec × Rec) / (Prec + Rec)

"Balance of precision and recall"

Coming in Later Chapters

We now understand overfitting and how to measure model performance. But we haven't covered:

Chapter 8

Regularization

Techniques to prevent overfitting: L1, L2, dropout

Question: How do we force models to generalize instead of memorize?

Chapter 10

Advanced Optimization

Adam optimizer, learning rate schedules, batch vs mini-batch training

Question: Why does ChatGPT train in weeks, not years?

Key Takeaways

🎯 From Regression to Classification

Core concepts from this chapter:

Classification Predicts Categories

Regression predicts continuous numbers (house price: $450k). Classification predicts discrete categories (email: spam or not spam). Use classification when the answer is a label, not a number.

Decision Boundaries Separate Data

Classification models learn a decision boundary (like a line or curve) that separates different categories in the feature space. Points on one side get classified one way, points on the other side get classified differently.

Multiple Solutions Can Work

Different decision boundaries can separate the same data reasonably well. Gradient descent might find different solutions depending on where it starts. This is normal in machine learning.

Accuracy Can Be Misleading

When classes are imbalanced (95 renewed, 5 churned), even a useless model can achieve high accuracy. The Confusion Matrix reveals what's really happening by showing True Positives, False Positives, True Negatives, and False Negatives.

Hyperplanes Scale to Many Dimensions

With 2 features, the boundary is a line. With 3 features, it's a plane. With more features, it's a hyperplane. The math extends naturally to any number of dimensions.

Training vs. Testing Matters

Models can perform well on training data but poorly on new data (overfitting). Always test on data the model hasn't seen to measure real-world performance.

Key Decision:
Predicting numbers? Use Regression. Predicting categories? Use Classification.

Both use the same core concepts: weights, bias, gradient descent, and loss functions. Only the output type and loss function change.