From real-world predictions to choosing the right error measure
In Chapter 1, we learned the complete pattern for house pricing, including the bias term: Price = $100k + $50k × bedrooms + $100k × bathrooms. Now we can predict prices for NEW houses that weren't in our training data!
Consider a house with 5 bedrooms and 2 bathrooms (not in our original training data)
The model predicts this house costs $550k!
We learn patterns from known data (training) to make predictions on unknown data (inference).
Use the learned pattern: Price = $100k (bias) + $50k × bedrooms + $100k × bathrooms
We call it "supervised" because we gave the machine labeled training data:
You're buying property. Even bare land (0 bedrooms, minimum 500 sqft) costs $100k for the lot, utilities, and foundation. Each bedroom adds $50k. Each square foot adds $100. The model learns these patterns from past sales!
Just like in Chapter 1, the bias represents the base land value. Even with 0 bedrooms and minimum land (500 sqft), the base land value is $100k. This accounts for:
Try it: Set bedrooms to 0 and sqft to 500. You'll see the price is $150k — that's the bias ($100k) plus the minimum sqft cost ($100 × 500 = $50k).
By learning weights from existing house sales, the model discovered: bias=$100k, w1=$50k/bedroom, w2=$100/sqft. Now it can predict prices for houses not yet on the market!
Price = w1×bedrooms + w2×bathroomsMillions/Billions of weights
The pattern was simple in our example - we needed only a small amount of labeled data.
Modern ML requires massive datasets - and their availability fuels the AI revolution!
You're building an email spam filter. For each email, you want to predict: Is this spam or not?
In Chapter 1, we used Mean Squared Error (MSE) to measure how far our predictions were from actual values:
This worked perfectly for predicting continuous numbers like house prices ($400k, $250k, $500k). But what if we want to predict categories like spam vs not-spam?
Your model looks at an email and outputs a probability:
The model is doing great! It's 95% confident this is spam, and it's correct. But how do we measure "how far off" the model's predictions are when they're not perfect?
Use when predicting continuous numbers
Story anchor: "How far off were we from the actual number?"
Use when predicting probabilities for categories
Story anchor: "How surprised should we be by the actual answer?"
Cross-entropy measures surprise. When your model is confident and correct, the loss is low. When your model is confident and WRONG, the loss explodes.
Adjust the prediction and see how "surprised" the model would be when it sees the actual answer
Prediction: 0.99 (99% spam)
Actual: 1 (spam)
🎯 Very low loss! Model is rewarded for being confident and right.
Prediction: 0.01 (1% spam)
Actual: 1 (spam)
💥 Huge loss! Model was confident it's not spam, but it IS spam. Maximum surprise!
Prediction: 0.60 (60% spam)
Actual: 1 (spam)
⚠️ Medium loss. Model got it right but wasn't confident. Room for improvement.
Prediction: 0.40 (40% spam)
Actual: 1 (spam)
⚠️ Model hedged its bets and was wrong. Higher loss than scenario 3.
The logarithm creates the "surprise" effect:
The negative sign flips it so that low loss = good performance.
The loss function must match your task type. Using a mismatched loss function is like using a thermometer to measure distance—it simply doesn't work.
Now that you understand both MSE and cross-entropy, you're ready to learn classification—where we use cross-entropy to train models that predict categories instead of numbers.
Core concepts from this chapter:
Training is learning patterns from known data. Inference is using those learned patterns to make predictions on new, unseen data. The whole point of ML is making accurate predictions on data the model has never seen.
Supervised learning means training with labeled data - each input has a correct answer attached. The model learns by comparing its predictions to these labels and adjusting weights to minimize the gap.
MSE (Mean Squared Error) works for predicting numbers like prices or temperatures. Cross-Entropy works for predicting categories or probabilities like spam/not-spam. Match the loss function to the task type.
Cross-entropy rewards confidence when correct (low loss) and heavily punishes confidence when wrong (high loss). It's about measuring how "surprised" the model should be by the actual answer.
Key Decision:
Predicting numbers? Use MSE. Predicting categories? Use Cross-Entropy.
The right loss function guides the model toward better predictions for the specific type of task.
Test what you've learned in this chapter!