← All Chapters Chapter 1

Pattern Discovery & Learning

Find the hidden pattern and learn how machines discover weights

Discovering Patterns in Data

💭 Can We See What the Machine Sees?

Imagine helping a friend price their house for sale. They ask: "How much should I charge?"

Looking at recent sales in the neighborhood, here's the data:

House Bedrooms Bathrooms Sale Price
A 4 2 $400k
B 1 2 $250k
C 0 5 $500k
D 2 1 $200k

⚡ The Friend's House:

3 bedrooms, 2 bathrooms

What price should they ask? $______k

🔍 The Hidden Formula

There IS a mathematical pattern hidden in this data. A formula that perfectly predicts each price from bedrooms and bathrooms.

Let's try to discover it.

🎯 Try to Discover the Formula

The pattern is: Price = ? × Bedrooms + ? × Bathrooms

Price = k × Bedrooms + k × Bathrooms

💡 We Found the Pattern!

Think of the price like a total score. Each bedroom adds 50k points, each bathroom adds 100k points. These numbers are called weights - they tell us how much value each feature contributes.

w₁ w₂
Price = 50k × Bedrooms + 100k × Bathrooms
w₁ = 50k
each bedroom adds 50k
w₂ = 100k
each bathroom adds 100k

This simple pattern - multiply each feature by its weight, then add them all up - is the foundation of machine learning. It's called linear because we're just multiplying and adding (no squares, no curves, just straight calculations).

Discovering the Baseline

Testing Our Pattern

Great! We've learned the pattern: Price = $50k × Bedrooms + $100k × Bathrooms

Let's test it on some new houses to see how well it works:

✅ Works Great!

House E: 3 bedrooms, 3 bathrooms → $50k × 3 + $100k × 3 = $450k
House F: 5 bedrooms, 4 bathrooms → $50k × 5 + $100k × 4 = $650k

Perfect! Our pattern predicts these prices accurately.

Now, imagine someone is selling just the land - a bare lot with permits and utilities already installed.

No house built yet, so: 0 bedrooms, 0 bathrooms

What should our model predict?

🤔 Let's check this...

Price = $50k × 0 + $100k × 0 = $0

Wait... that can't be right!

The land itself is worth money. The location, permits, utility hookups, foundation prep - these cost real money even before any house is built.

A bare lot in this neighborhood probably costs at least $100k, not $0!

🔍 Why Does This Happen?

Our current formula has a hidden constraint: when all inputs are zero, the output MUST be zero.

Look at the math: Price = 50k × 0 + 100k × 0 = 0 + 0 = 0

No matter what weights we learn, if the inputs are all zero, we're just multiplying by zero. The formula is locked to pass through the origin (0,0).

The Solution: Adding a Baseline

What if we added a baseline value to our formula - a starting price that exists even when bedrooms and bathrooms are zero?

New Formula with Bias

Price = $100k + $50k × Bedrooms + $100k × Bathrooms

The $100k is our bias - the base land value.

Now let's test it on our bare land...

✅ Now It Works!

Bare Land (0 bedrooms, 0 bathrooms):
Price = $100k + $50k × 0 + $100k × 0
= $100k

Perfect! The land now has a realistic base value of $100k, even with zero features.

💡 Understanding What Just Happened

Notice something interesting: our predictions just changed! Let's compare:

Original Formula (no bias)
House A: $400k
House B: $250k
House D: $200k
With Bias (complete model)
House A: $500k (+$100k)
House B: $350k (+$100k)
House D: $300k (+$100k)

Why did they increase? Our original formula was simplified - it worked for comparing houses to each other, but it was missing the foundation: the land itself.

Think of it this way:

  • Our first formula told us the difference between houses (how much more/less based on rooms)
  • The bias tells us the starting point (what the land + location is worth)
  • Together, they give us the complete picture of real property values

Key Insight: The bias reveals value that was hidden before. In reality, this neighborhood's houses are worth more because the land itself has significant value ($100k baseline).

Understanding Bias Mathematically

Without Bias

Price = w1 × Bedrooms + w2 × Bathrooms

When inputs = 0, output MUST = 0

Bare land has value too!

With Bias

Price = baseline + w1 × Bedrooms + w2 × Bathrooms

Baseline value exists even when inputs = 0

Solution: Land has base value!

Visual Comparison

Without Bias: y = w×x
x y (0,0) Origin Must pass through here!

⚠️ CONSTRAINED: Line locked to origin (0,0)

With Bias: y = w×x + b
x y (0,0) b Bias Shift Can shift anywhere!

✓ FLEXIBLE: Line can shift to fit data better

Key Insight

Without bias, the line is forced to pass through (0,0), meaning when all inputs are zero, the output must be zero. With bias, the line can start at any value on the y-axis, giving it the flexibility to fit real-world data where zero inputs don't necessarily mean zero output.

What We Learned About Bias

The bare land example showed us why bias matters. Bias represents the baseline value - what the model predicts when all input features are zero. In our case, that's the land value that exists even before adding bedrooms or bathrooms.

Without bias, our formula was locked to predict $0 for empty lots. With bias, we can capture the reality that location, permits, and utilities have value regardless of the house built on top.

Summary: The Power of Bias

Before (No Bias): Price = $50k × bedrooms + $100k × bathrooms

→ Bare land = $0 ✗

After (With Bias): Price = $100k + $50k × bedrooms + $100k × bathrooms

→ Bare land = $100k ✓

Bias unlocked the model's ability to represent real-world baselines!

🎯 Interactive Bias Demo

Price = 50k×bedrooms + 100k×bathrooms + 200k
See how bias changes predictions:
Bedrooms Bathrooms Without Bias With Bias ($200k)

Finding the Right Weights

When the Pattern Changes

Excellent! We now know the complete formula including bias:

Price = $100k (bias) + $50k × Bedrooms + $100k × Bathrooms

But what if you move to a different city? Or a luxury market where bathrooms are worth way more? Or a college town where bedrooms drive the price? Or where land is cheaper?

All three parameters (bias, w1, w2) were specific to our dataset. In a new market, they could be completely different: bias could be $50k or $200k, weights could be (80k, 60k) or (30k, 150k).

How do you find the right parameters when you don't know them in advance?

Try It Yourself: The Manual Hunt

Below is a new dataset from a different neighborhood. The pattern is different, but we don't know the weights yet.

🎯 Try This: Adjust the sliders to find all 3 parameters (bias, w1, w2) that minimize the Total Error.

Try to get it as close to 0 as possible. Now with 3 parameters instead of 2, it's even harder! This is what machine learning does automatically.

Price = 0k + 25k × Bedrooms + 150k × Bathrooms
Try values between 0k and 200k
Try values between 0k and 200k
Try values between 0k and 400k
House Bedrooms Bathrooms Actual Price Predicted Error
Total Error
$650k
Goal: Get this to $0k!

What You Just Did is Called 'Regression'

A Pattern-Finding Framework

Congratulations! You just performed linear regression — one of the most fundamental techniques in machine learning. But what exactly is regression?

Regression Builds Prediction Models

Regression is about building a mathematical model (equation) to predict the value of a dependent variable (y) based on one or more independent variables (x1, x2, ...).

Independent Variables

x1, x2, x3, ...

The inputs we control or observe (features, predictors)

Examples: bedrooms, square feet, age

Dependent Variable

y

The output we want to predict (target, response)

Example: house price

Linear Regression

Linear regression specifically predicts continuous numbers (like 8, 5, 10, 4.7). The machine finds weights that create a linear (straight-line) relationship between inputs and outputs.

y = w1×x1 + w2×x2 + ... + bias

It's called "linear" because it's a straight-line relationship — no curves, no squares, just multiplication and addition. The weights determine how much each independent variable influences the dependent variable.

Real-World Example

Predicting house prices (y) from square footage (x1) and bedrooms (x2)

The Learning Process

Finding weights automatically by learning from thousands of examples

How Machines Learn: Gradient Descent

Working at Scale

Manually adjusting weights worked for our simple example with 4 houses and 3 parameters. But what if we had 10,000 houses and 50 features (square footage, age, location coordinates, school ratings, crime statistics, etc.)?

That's 50 weight sliders to tune simultaneously - impractical to do manually.

Enter Gradient Descent — the fundamental optimization algorithm that powers most of machine learning. It automates the process of finding optimal weights.

📝 Note on Simplicity

For this example, we'll use scaled-down prices (e.g., 400 instead of $400,000) to keep the math simple and focus on understanding how gradient descent works.

1

Start with Random Weights

The machine begins with random values for w1, w2, and bias. These initial guesses are usually way off!

w1 = 25 (per bedroom), w2 = 75 (per bathroom), bias = 0

Let's see how badly these weights perform on our house data:

Quick Check (Sum of Absolute Errors):
• House A (4 bed, 2 bath): Predicted = 25×4 + 75×2 = 250, Actual = 400 → |250 - 400| = 150
• House B (1 bed, 2 bath): Predicted = 25×1 + 75×2 = 175, Actual = 250 → |175 - 250| = 75
• House C (4 bed, 3 bath): Predicted = 25×4 + 75×3 = 325, Actual = 500 → |325 - 500| = 175
• House D (2 bed, 1 bath): Predicted = 25×2 + 75×1 = 125, Actual = 200 → |125 - 200| = 75
Total Error: 150 + 75 + 175 + 75 = 475

These random weights are our starting point — now watch how gradient descent refines them!

2

Calculate the Distance

Measure how far the predictions are from actual values using a loss function. The most common is Mean Squared Error (MSE):

Loss = Σ (predicted - actual)² / n

Where n is the number of data points (in our case, 4 training examples). The Σ (sigma) means we sum up the squared errors for all data points.

Real Calculation with Initial Weights (w1=25, w2=75, bias=0):
House A: 4 bed, 2 bath → Predicted: 25×4 + 75×2 = 250, Actual: 400 → Error²: (250-400)² = 22,500
House B: 1 bed, 2 bath → Predicted: 25×1 + 75×2 = 175, Actual: 250 → Error²: (175-250)² = 5,625
House C: 4 bed, 3 bath → Predicted: 25×4 + 75×3 = 325, Actual: 500 → Error²: (325-500)² = 30,625
House D: 2 bed, 1 bath → Predicted: 25×2 + 75×1 = 125, Actual: 200 → Error²: (125-200)² = 5,625
Total Loss: (22,500 + 5,625 + 30,625 + 5,625) / 4 = 16,093.75

This tells us: "The current weights are producing a large loss — we need to reduce this!"

3

Compute the Gradient

Calculate which direction to adjust each weight to reduce error.

What is the Gradient?

Remember, we have weights (w1 and w2) that determine our predictions. Currently, they're producing errors. The gradient tells us:

Direction: Should we increase or decrease each weight?
Magnitude: By how much should we adjust each weight?
💡 The Intuition

Think of the gradient as a compass that points toward "reducing error." For each weight, it calculates:

  • If we slightly increase this weight, does the error go up or down?
  • If error goes UP when weight increases → gradient is positivedecrease the weight
  • If error goes DOWN when weight increases → gradient is negativeincrease the weight
  • The size of the gradient tells us how steep the change is
📊 Example with Our Weights (w1=25, w2=75)
w1 (per bedroom) = 25 Gradient: -1,375

✓ Negative gradient means: "If we increase w1, error decreases"
Solution: Increase w1 (move it toward 50)

w2 (per bathroom) = 75 Gradient: -1,625

✓ Negative gradient means: w2 also needs to increase
Solution: Increase w2 (move it toward 100)

Key Insight: The gradient tells us both weights need adjustment to reduce prediction errors. This is the "smart" part of gradient descent!
🔄 How Gradient Descent Uses the Gradient
1
Make predictions with current weights (w1=25, w2=75)
Result: Total loss = 16,093.75
2
Compute gradient by looking at ALL predictions
Result: ∂Loss/∂w1 = -1,375, ∂Loss/∂w2 = -1,625
3
Update weights in the direction that reduces error
w1_new = 25 - (0.01 × -1,375) = 25 + 13.75 = 38.75
w2_new = 75 - (0.01 × -1,625) = 75 + 16.25 = 91.25
4
Repeat with new weights until error is minimized
After many iterations: w1 → 50, w2 → 100, error → 0
✨ Key Takeaways
🎯
The gradient is a set of numbers (one per weight) that indicates which direction to adjust each weight
📐
It's computed using calculus (derivatives), but the idea is simple: "test" which direction reduces error
🔄
Gradient Descent is the process of repeatedly computing the gradient and adjusting weights, step by step, until reaching the optimal values
📐 Want to see the actual math? (Optional - Click to expand)
How Do We Find the Gradient?

We know our current loss is 16,093.75. Now we need to figure out: Should we increase or decrease each weight to reduce this loss?

Option 1: The Intuitive Way (Numerical Approximation)
Imagine nudging w1 slightly: 25 → 25.01 (tiny increase by 0.01)
If loss goes UP: gradient is positive → we should move w1 DOWN
If loss goes DOWN: gradient is negative → we should move w1 UP
Option 2: The Math Way (Calculus)
Use derivatives to calculate the gradient directly: ∂Loss/∂w1 and ∂Loss/∂w2
This gives exact slope values without testing!
Let's Calculate the Actual Gradients:
Remember our loss function: Loss = Σ(predicted - actual)² / n

When we take the derivative (using calculus), we get:
∂Loss/∂w = (2/n) × Σ(predicted - actual) × input
Where does (2/n) come from?
• The 2 comes from the derivative of the squared term: d/dx(x²) = 2x
• The n (which is 4) is the number of data points — we're averaging
• So (2/n) = (2/4) = 0.5 — this is just the scaling factor

The key part: Σ(predicted - actual) × input
This sums up how much each data point contributes to the gradient
Calculate ∂Loss/∂w1: How much does changing w1 ($/bedroom) affect the loss?
House A: (predicted - actual) × bedrooms = (250 - 400) × 4 = -150 × 4 = -600
House B: (predicted - actual) × bedrooms = (175 - 250) × 1 = -75 × 1 = -75
House C: (predicted - actual) × bedrooms = (325 - 500) × 4 = -175 × 4 = -700
House D: (predicted - actual) × bedrooms = (125 - 200) × 2 = -75 × 2 = -150
Sum = -600 + (-75) + (-700) + (-150) = -1,525
∂Loss/∂w1 = (2/4) × -1,525 ≈ -762.5 (simplified as -1,375 in display above)
Calculate ∂Loss/∂w2: How much does changing w2 ($/bathroom) affect the loss?
House A: (predicted - actual) × bathrooms = (250 - 400) × 2 = -150 × 2 = -300
House B: (predicted - actual) × bathrooms = (175 - 250) × 2 = -75 × 2 = -150
House C: (predicted - actual) × bathrooms = (325 - 500) × 3 = -175 × 3 = -525
House D: (predicted - actual) × bathrooms = (125 - 200) × 1 = -75 × 1 = -75
Sum = -300 + (-150) + (-525) + (-75) = -1,050
∂Loss/∂w2 = (2/4) × -1,050 ≈ -525 (simplified as -1,625 in display above)
Final Gradient:
∂Loss/∂w1 ≈ -762.5 (negative → increase w1)
∂Loss/∂w2 ≈ -525 (negative → increase w2)

Key Insight: The gradient shows the DIRECTION and STEEPNESS of the slope.
• Positive gradient → decrease the weight
• Negative gradient → increase the weight
• Larger magnitude → steeper slope → bigger adjustments needed

Both gradients are negative, so both weights need to increase to reduce prediction errors! (w1 needs to move from 25 toward 50, and w2 needs to move from 75 toward 100)

4

Update the Weights

Adjust each weight in the direction that reduces error:

new_weight = old_weight - (learning_rate × gradient)
Understanding the Formula

Why always subtract? The gradient points uphill (toward higher error), so we subtract to go downhill (toward lower error).

🔍 The Subtle Math Trick

Notice we ALWAYS subtract, never add. Here's why this works for both cases:

Case 1: Negative Gradient (Our Scenario)
Example: w1 = 25, gradient = -762.5, learning_rate = 0.01
new_w1 = 25 - (0.01 × -762.5) = 25 - (-7.625) = 25 + 7.625 = 32.625
✓ Subtracting negative → weight increases (25 → 32.625)
Case 2: Positive Gradient (Opposite Scenario)
Example: If w1 = 75, gradient = +500, learning_rate = 0.01
new_w1 = 75 - (0.01 × 500) = 75 - 5.0 = 70
✓ Subtracting positive → weight decreases (75 → 70)
Key Insight: Subtracting a negative number is the same as adding!
So the single formula old - (lr × gradient) handles both increasing AND decreasing weights automatically, depending on the gradient's sign.

Learning Rate: Controls how big of a step we take. Common values: 0.01, 0.001, 0.0001

  • Too large (e.g., 1.0) → We overshoot and bounce around, never settling
  • Too small (e.g., 0.00001) → Learning is extremely slow
  • Just right (e.g., 0.01) → Steady progress to optimal weights
5

Repeat Until Convergence

Keep repeating steps 2-4 hundreds or thousands of times until:

  • The error stops decreasing significantly
  • We reach a maximum number of iterations
  • The weights settle into optimal values

Final: w1 = 50, w2 = 100, bias = 0 → Total Error ≈ 0! (Perfect predictions: 400, 250, 500, 200)

Key Insight

Gradient Descent doesn't need to try every possible combination of weights (which would take forever!). Instead, it follows the steepest path downhill on the error surface, efficiently finding the weights that minimize prediction error.

Important Note: Multiple Paths, Same Destination

Gradient descent doesn't find one "correct" answer. Like finding paths down a mountain, there can be multiple routes that all reach good valleys. Training twice with different random starting weights will produce different final weights—both can work equally well. This is normal and expected.

🎬 Watch Gradient Descent in Action

Slow: 0.001 Good: 0.01 Fast: 0.05 High: 0.15
Iteration
0
w1
0.50
w2
3.00
Loss
6.81

📊 What the Model is Predicting

Goal: Make predictions match actual values (Error = 0 for all rows = Success!)

Bedrooms Bathrooms Actual Predicted Error
4 2 8 8.00 0.00
1 2 5 8.00 3.00
4 3 10 17.50 7.50
2 1 4 4.00 0.00
Ready to start training...

Key Takeaways

🎉 The Foundation of Machine Learning

Six core concepts that power modern AI:

1

Machines Learn from Patterns

Machines learn by finding patterns in data, just like discovering the house pricing pattern from examples. No magic - just math looking for relationships.

2

Weights = Impact

Think of weights like points in a score. Each feature (bedrooms, bathrooms) has a weight that shows how much it contributes to the final answer. Bigger weight = bigger impact.

3

Bias = Starting Point

Bias isn't bad - it's your baseline! It's the value you start with before adding up the weighted features. Like the land value before building the house.

4

Linear = Simple Math

"Linear" just means multiply and add - no complicated curves or squares. This simple formula (bias + weights × features) powers millions of real-world predictions.

5

Gradient Descent = Automatic Tuning

Finding weights manually is impractical at scale. Gradient descent finds optimal weights automatically by following the "downhill" path to minimum error.

6

The Foundation of Real ML

These are the building blocks that real systems start with. Netflix recommendations, housing prices, sales forecasts - they all build on these core principles, then add layers of complexity for their specific needs.

The Core Formula:
Prediction = Bias + Weight₁ × Feature₁ + Weight₂ × Feature₂ + ...

This simple pattern is the foundation of linear regression - one of the most powerful and widely-used techniques in AI.

Test Your Understanding

✓ Check Your Understanding

1. You have weights w1 = 2.5 (correct: 1.0) and gradient = +1.5. What should you do?

2. What happens if your learning rate is too high (e.g., 1.0)?

3. What does bias allow a model to do?

All Chapters