← All Chapters Chapter 2

Applications & Loss Functions

From real-world predictions to choosing the right error measure

The Power of Learned Weights

Making Predictions on NEW Data

In Chapter 1, we learned the complete pattern for house pricing, including the bias term: Price = $100k + $50k × bedrooms + $100k × bathrooms. Now we can predict prices for NEW houses that weren't in our training data!

Example: New House

Consider a house with 5 bedrooms and 2 bathrooms (not in our original training data)

Price = $100k + $50k × 5 + $100k × 2 = $100k + $250k + $200k = $550k

The model predicts this house costs $550k!

This is the whole point!

We learn patterns from known data (training) to make predictions on unknown data (inference).

🎯 Try Predicting House Prices

Use the learned pattern: Price = $100k (bias) + $50k × bedrooms + $100k × bathrooms

Price = $100k + $50k × 5 + $100k × 2 = $ 550 k
Predicted Price:
$550k

This is Supervised Learning

We call it "supervised" because we gave the machine labeled training data:

  • Inputs (bedrooms, bathrooms): The features we know about each house
  • Label (price): The correct answer (actual sale price) attached to each input
  • Training Data: Our table with (4 bed, 2 bath)→$400k, (1 bed, 2 bath)→$250k, etc.
  • Learning: Finding weights AND bias that match house features to prices

Real-World Linear Regression

House Price Prediction

You're buying property. Even bare land (0 bedrooms, minimum 500 sqft) costs $100k for the lot, utilities, and foundation. Each bedroom adds $50k. Each square foot adds $100. The model learns these patterns from past sales!

0
Land
1 2 3 4 5
500
Min
2750 5000
Predicted Price
$450,000
$100k + ($50k × 3 bedrooms) + ($100 × 2000 sqft) = $450k
Formula: Price = $100k (bias) + $50k × bedrooms + $100 × sqft

💡 Why the $100k Bias?

Just like in Chapter 1, the bias represents the base land value. Even with 0 bedrooms and minimum land (500 sqft), the base land value is $100k. This accounts for:

  • Land acquisition rights
  • Utilities hookup (water, electricity, gas)
  • Foundation and permits

Try it: Set bedrooms to 0 and sqft to 500. You'll see the price is $150k — that's the bias ($100k) plus the minimum sqft cost ($100 × 500 = $50k).

By learning weights from existing house sales, the model discovered: bias=$100k, w1=$50k/bedroom, w2=$100/sqft. Now it can predict prices for houses not yet on the market!

From Simple to Complex

Our Learning Example

2 features → 1 prediction
Price = w1×bedrooms + w2×bathrooms
4 training houses

Modern ML

1000s of features → Many predictions
Millions/Billions of weights
Millions of training examples

The pattern was simple in our example - we needed only a small amount of labeled data.
Modern ML requires massive datasets - and their availability fuels the AI revolution!

When Numbers Aren't Enough: Measuring Different Kinds of Errors

Building an Email Spam Filter

You're building an email spam filter. For each email, you want to predict: Is this spam or not?

Why MSE Doesn't Work Here

In Chapter 1, we used Mean Squared Error (MSE) to measure how far our predictions were from actual values:

MSE = Σ (predicted - actual)² / n

This worked perfectly for predicting continuous numbers like house prices ($400k, $250k, $500k). But what if we want to predict categories like spam vs not-spam?

📧 Building a Spam Filter

Your model looks at an email and outputs a probability:

Email: "Congratulations! You won $1,000,000! Click here now!"
Model's prediction: 0.95 (95% confident it's spam)
Actual label: 1 (yes, it is spam)

The model is doing great! It's 95% confident this is spam, and it's correct. But how do we measure "how far off" the model's predictions are when they're not perfect?

MSE for Regression

Use when predicting continuous numbers

Loss = Σ (predicted - actual)² / n
Example:
Predicted price: $450k
Actual price: $500k
Error²: ($450k - $500k)² = $2.5B

Story anchor: "How far off were we from the actual number?"

Cross-Entropy for Classification

Use when predicting probabilities for categories

Loss = -[actual × log(predicted) + (1-actual) × log(1-predicted)]
Example:
Predicted: 0.95 (95% spam)
Actual: 1 (yes, spam)
Loss: -[1 × log(0.95)] = 0.05

Story anchor: "How surprised should we be by the actual answer?"

Understanding Cross-Entropy: The Surprise Measure

Cross-entropy measures surprise. When your model is confident and correct, the loss is low. When your model is confident and WRONG, the loss explodes.

🎲 Interactive Surprise Calculator

Adjust the prediction and see how "surprised" the model would be when it sees the actual answer

1%
Not Spam
50%
Unsure
99%
Spam
Surprise Level
Low Medium High HUGE!
Cross-Entropy Loss
0.69

Four Scenarios: How Cross-Entropy Rewards and Punishes

Confident & Correct

Prediction: 0.99 (99% spam)

Actual: 1 (spam)

Loss = -log(0.99) = 0.01

🎯 Very low loss! Model is rewarded for being confident and right.

Confident & Wrong

Prediction: 0.01 (1% spam)

Actual: 1 (spam)

Loss = -log(0.01) = 4.61

💥 Huge loss! Model was confident it's not spam, but it IS spam. Maximum surprise!

🤔
Uncertain & Correct

Prediction: 0.60 (60% spam)

Actual: 1 (spam)

Loss = -log(0.60) = 0.51

⚠️ Medium loss. Model got it right but wasn't confident. Room for improvement.

🤷
Uncertain & Wrong

Prediction: 0.40 (40% spam)

Actual: 1 (spam)

Loss = -log(0.40) = 0.92

⚠️ Model hedged its bets and was wrong. Higher loss than scenario 3.

Key Insight: Why "log"?

The logarithm creates the "surprise" effect:

  • log(0.99) = -0.01 → Small surprise (model said 99%, answer was yes)
  • log(0.50) = -0.69 → Medium surprise (model was unsure, 50-50)
  • log(0.01) = -4.61 → HUGE surprise (model said 1%, answer was yes!)

The negative sign flips it so that low loss = good performance.

Choosing the Right Loss Function

The loss function must match your task type. Using a mismatched loss function is like using a thermometer to measure distance—it simply doesn't work.

What are you predicting?

Continuous Numbers
House price: $450,000
Temperature: 72.5°F
Stock price: $152.30
Use MSE
(Mean Squared Error)
Categories/Probabilities
Spam: Yes/No
Image: Cat/Dog/Bird
Sentiment: Positive/Negative
Use Cross-Entropy
(Log Loss)

Coming Up in Chapter 3

Now that you understand both MSE and cross-entropy, you're ready to learn classification—where we use cross-entropy to train models that predict categories instead of numbers.

Key Takeaways

🎯 From Training to Prediction

Core concepts from this chapter:

1

Training vs. Inference

Training is learning patterns from known data. Inference is using those learned patterns to make predictions on new, unseen data. The whole point of ML is making accurate predictions on data the model has never seen.

2

Supervised Learning

Supervised learning means training with labeled data - each input has a correct answer attached. The model learns by comparing its predictions to these labels and adjusting weights to minimize the gap.

3

Different Tasks, Different Measures

MSE (Mean Squared Error) works for predicting numbers like prices or temperatures. Cross-Entropy works for predicting categories or probabilities like spam/not-spam. Match the loss function to the task type.

4

Cross-Entropy Measures Surprise

Cross-entropy rewards confidence when correct (low loss) and heavily punishes confidence when wrong (high loss). It's about measuring how "surprised" the model should be by the actual answer.

Key Decision:
Predicting numbers? Use MSE. Predicting categories? Use Cross-Entropy.

The right loss function guides the model toward better predictions for the specific type of task.

Test Your Understanding

Test what you've learned in this chapter!

1. If you train a model with weights w1=$50k (per bedroom) and w2=$100 (per sqft), what would it predict for a house with 4 bedrooms and 3000 sqft?

2. When should you use Cross-Entropy loss instead of MSE (Mean Squared Error)?

3. Your spam filter predicts 99% confident an email is spam, and it actually IS spam. What will the cross-entropy loss be?

All Chapters