← All Chapters Chapter 3

Probability & Uncertainty

Why machine learning predictions are probabilities, not guarantees

Why Machines Need Probability

The Coffee Shop Problem

Imagine you run a coffee shop. Every morning, you need to decide: How many blueberry muffins should I bake?

Three Scenarios:

  • Monday: You sold 20 muffins
  • Tuesday: You sold 18 muffins
  • Wednesday: You sold 22 muffins

For Thursday, should you bake exactly 20 muffins?

The Reality of Uncertainty

You can't know the exact number. Thursday could bring 15 customers or 25. Weather changes, events happen, people get sick.

You can't predict the future with certainty. But you CAN estimate probabilities.

The Probability Mindset

Instead of saying "I'll sell exactly 20 muffins," you think:

15 muffins: 10% chance
18-22 muffins: 70% chance
25+ muffins: 20% chance

This is probability: quantifying uncertainty with numbers.

Machine Learning = Predicting with Uncertainty

In Chapter 2, we saw the spam classifier output "0.95" for an email. What does that mean?

❌ Wrong Interpretation

"This email is DEFINITELY spam"

✅ Correct Interpretation

"I'm 95% confident this is spam"

(There's still a 5% chance it's legitimate)

Key Insight

Every ML prediction is a probability statement about uncertainty.

Even when your model says "99% spam," it's saying "Based on patterns I learned, I believe there's a 99% chance this is spam." It's not omniscient—it's making an educated guess using data.

Probability Fundamentals

What IS Probability?

Probability is a number between 0 and 1 that represents how likely an event is to happen.

0
Impossible
Coin on edge
0.5
Even odds
Coin shows heads
1
Certain
Sun rises tomorrow

The Three Rules of Probability

Rule 1: Probabilities Are Between 0 and 1

0 ≤ P(event) ≤ 1

You can't have -0.3 probability or 1.5 probability. Only values from 0 to 1 make sense.

Rule 2: Something Must Happen

P(all possible outcomes) = 1

Example: When you flip a coin, either heads (H) or tails (T) must happen:

P(H) + P(T) = 0.5 + 0.5 = 1 ✅

Rule 3: Opposite Events Add to 1

P(event happens) + P(event doesn't happen) = 1

Example: If probability of rain is 0.3, then probability of no rain is:

P(no rain) = 1 - 0.3 = 0.7

Interactive: Coin Flip Experiment

Let's see probability in action. Flip a coin many times and watch the proportion of heads approach 0.5.

?
Total Flips: 0
Heads: 0
Proportion: -

Expected: 0.5 (50% heads)

What You'll Notice

After 10 flips: proportion might be 0.3 or 0.7 (far from 0.5)

After 100 flips: proportion gets closer to 0.5

After 1000 flips: proportion is very close to 0.5

This is called the Law of Large Numbers: As you collect more data, observed frequencies approach true probabilities.

A Puzzle: The Monty Hall Problem

You're on a Game Show...

Imagine you're on a game show. There are 3 doors. Behind one door is a car (the prize you want). Behind the other two doors are goats (you don't want goats).

🚪

Door 1

🚪

Door 2

🚪

Door 3

Here's what happens:

  1. You pick Door 1 (you don't open it yet)
  2. The host (Monty Hall) knows what's behind all the doors. He opens Door 3, revealing a goat 🐐
  3. Now only 2 doors remain closed: Door 1 (your choice) and Door 2
  4. Monty asks: "Do you want to STAY with Door 1, or SWITCH to Door 2?"
🚪

Door 1

Your original choice

🚪

Door 2

Still closed

🐐

Door 3

OPENED - Goat revealed

What Would You Do?

🤔 Your First Instinct

"Well, there are now 2 doors left. One has a car, one has a goat. So it's 50/50, right? It doesn't matter if I stay or switch!"

This seems totally logical. After all, there are only 2 options remaining.

⚠️ But Wait...

If you SWITCH to Door 2, you have a 2/3 (66.7%) chance of winning the car!

If you STAY with Door 1, you only have a 1/3 (33.3%) chance of winning!

Wait... what?! How is it NOT 50/50?? 🤯

Here's Where It Gets Interesting

I know what you're thinking: "Two doors, so 50/50." That was my first thought too.

But something changed when Monty opened Door 3. That action gave us new information—and the probabilities shifted.

To understand why, we need to learn about conditional probability—the most important concept in this entire chapter.

What is Conditional Probability?

Probability That Changes with New Information

Conditional probability is just: "What's the probability of something happening, NOW THAT I know something else?"

🌧️ Everyday Example: Will It Rain?

Before You Look Outside

Question: What's the probability it will rain today?

30%

P(rain)

Just based on weather forecast for your city

You look outside and see dark clouds!

After You See Dark Clouds

Question: What's the probability it will rain today, GIVEN THAT the sky has dark clouds?

80%

P(rain | dark clouds)

This is conditional probability!

📝 The Notation: P(A | B)

The vertical bar "|" means "given that" or "knowing that"

P(rain | dark clouds)

"The probability it will rain,
given that I know there are dark clouds"

Key Insight: The "|" separates what you're calculating (left side) from what you already know (right side)

More Everyday Examples

🎓

Getting Hired

Without information:

P(get hired) = 5%

Only 5% of applicants get hired

Knowing you have 10 years experience:

P(hired | 10 yrs exp) = 40%

New information changes the probability!

🚗

Traffic Jam

Without information:

P(traffic jam) = 20%

On a typical day

Knowing it's rush hour:

P(traffic | rush hour) = 75%

Much higher during rush hour!

🎲

Rolling Dice

Without information:

P(roll a 6) = 1/6

Any number is equally likely

Knowing you rolled an even number:

P(6 | even) = 1/3

Only 2, 4, or 6 are possible now!

💡 The Core Idea

Conditional probability is how probabilities update when you learn new information. The new information changes what's possible, so the probabilities change too!

Connecting Back: The Monty Hall Problem

Now we can understand why switching works in the Monty Hall problem!

It's All About Conditional Probability

At the Start

You pick Door 1. The car is equally likely behind any door:

P(car behind Door 1) = 1/3
P(car behind Door 2) = 1/3
P(car behind Door 3) = 1/3

Monty opens Door 3 (reveals a goat)
After Monty Opens Door 3

Now we use conditional probability:

P(car at Door 1 | Monty opened Door 3) = 1/3

Your original choice doesn't change

P(car at Door 2 | Monty opened Door 3) = 2/3

Door 2 gets all the probability from Door 3!

🎯 Why Switching Works

Monty's action (opening a door) gives you new information. The conditional probability P(car at Door 2 | Monty opened Door 3) = 2/3 is higher than your original choice (1/3), so switching doubles your chances!

Conditional probability is the foundation. But there are two completely different ways to think about probability itself. Understanding both will reveal why machine learning works the way it does...

Two Ways to Think About Probability

The Monty Hall Problem: Two Approaches

There are TWO different philosophies for what "probability" means. The famous Monty Hall problem shows this difference beautifully.

The Setup:

You're on a game show. There are 3 doors. Behind one is a car 🚗, behind the other two are goats 🐐.

  1. You pick a door (say, Door 1)
  2. The host (who knows where the car is) opens a different door showing a goat (say, Door 3)
  3. Question: Should you switch to Door 2, or stay with Door 1?

Surprisingly: You should ALWAYS switch! But why? Let's see how Frequentist and Bayesian thinkers arrive at this answer...

Frequentist: "Let's Run 1000 Games"

Discover through repetition

Interactive Simulator

A Frequentist says: "I don't speculate. Let me play this game many times and see what happens."

Total Games: 0
Stay Strategy Wins: 0 (--)
Switch Strategy Wins: 0 (--)
Stay:
0%
Switch:
0%

Frequentist Conclusion

"After 1000 games, switching wins ~67% of the time. Therefore, switch!"

The Frequentist discovers the answer through experimentation and observation of long-run frequencies.

✅ Strengths

  • Objective—based on actual outcomes
  • No assumptions needed
  • Perfect for repeatable experiments

⚠️ Limitations

  • Requires many repetitions to be confident
  • Can't analyze a single game logically
  • Doesn't explain why switching works

Bayesian: "Update My Beliefs"

Deduce through logic

Belief Meter Visualization

A Bayesian says: "I start with initial beliefs, then UPDATE them as I get new information."

Stage 1: Initial Pick

You pick Door 1. Each door equally likely has the car:

🚪 1
33%
🚪 2
33%
🚪 3
33%
Host opens Door 3 → 🐐
⬇️ UPDATE BELIEFS
Stage 2: After Host Opens Door 3

NEW INFORMATION: Host revealed a goat behind Door 3

Key insight: The host knows where the car is and will NEVER open the car door!

🚪 1
(your pick)
33%
unchanged
🚪 2
67%
DOUBLED!
🐐 3
(eliminated)
0%
The Mathematics:
P(Car at Door 2 | Host opened 3) =
P(Host opens 3 | Car at 2) × P(Car at 2) / P(Host opens 3)
= (1.0 × 1/3) / (1/2) = 2/3 = 67%

When you picked Door 1, there was a 2/3 chance the car was behind Door 2 OR Door 3. The host just eliminated Door 3, so all that probability flows to Door 2!

Bayesian Conclusion

"I'm 67% confident the car is behind Door 2. Switch!"

The Bayesian deduces the answer through logical reasoning and belief updating from a single game.

✅ Strengths

  • Works from a single observation
  • Explains why something is true
  • Incorporates logic and prior knowledge

⚠️ Limitations

  • Requires understanding the problem structure
  • Depends on choosing correct priors
  • More complex mathematically

Making It Intuitive: 100 Doors

Still not convinced switching helps? Let's scale up the problem to make the logic crystal clear.

Imagine: 100 Doors, 1 Car, 99 Goats

1
You pick Door #1

Probability you picked correctly: 1/100 = 1%

Probability the car is behind one of the other 99 doors: 99/100 = 99%

2
Host opens 98 doors—ALL showing goats!

Only your Door #1 and Door #47 remain closed.

The host knows where the car is. He deliberately left Door #47 closed.

3
Should you switch to Door #47?
Stay with Door #1
1%

Your initial random guess

VS
Switch to Door #47
99%

All the other 99 doors' probability!

OBVIOUSLY you should switch!

Why This Makes It Clear

With 100 doors, it's intuitive that:

  • Your initial random guess had only 1% chance of being right
  • The car had 99% chance of being behind one of the other doors
  • The host didn't leave Door #47 closed by accident—he left it because it might have the car!
  • All that 99% probability is now concentrated on Door #47

The same logic applies to 3 doors, but with 100 doors, our intuition finally catches up!

Frequentist vs Bayesian: Summary

Aspect Frequentist Bayesian
Probability means... Long-run frequency in repeated trials Degree of belief given available information
How they solve Monty Hall... Simulate 1000 games, observe switching wins 67% Update beliefs from 33%/33%/33% to 33%/67%/0%
Answer comes from... Empirical observation Logical deduction
Best for... Repeatable experiments, quality control, A/B testing One-time decisions, incorporating prior knowledge, sequential learning
In Machine Learning... Training models on large datasets Online learning, spam filters, recommendation systems

Which Should You Use?

Both! They complement each other:

  • Frequentist thinking: Great when you have lots of data and can repeat experiments. Used for hypothesis testing, confidence intervals, and classical ML training.
  • Bayesian thinking: Essential when you have prior knowledge or can't repeat the experiment. Powers spam filters, medical diagnosis, and adaptive systems.

Modern AI systems use both approaches. Neural networks are trained with frequentist methods but make predictions that are interpreted as Bayesian probabilities!

Why We Naturally Think Like Frequentists

🧠
Optional: The Psychology Behind Our Thinking
Cognitive psychology shows we naturally think like frequentists—ignoring base rates (priors) and focusing only on immediate evidence. Click to explore the famous "Cab Problem" and learn why humans commit the base rate fallacy.
(Click to expand)

Here's Something Interesting About How We Think

Cognitive psychology research has discovered something fascinating: we naturally tend to think like frequentists. When making decisions, we often ignore prior probabilities (base rates) and focus only on the immediate evidence in front of us.

The Base Rate Fallacy

When we make decisions, we often commit what psychologists call the "base rate fallacy" — ignoring general probability information (priors) in favor of specific case information.

📋 Classic Example: The Cab Problem

The Facts:

  • 85% of cabs in the city are Green
  • 15% of cabs in the city are Blue
  • A witness says they saw a Blue cab at the scene of a hit-and-run
  • The witness is 80% reliable (correct 80% of the time)

Question: What's the probability it was actually a Blue cab?

❌ Your First Instinct: "About 80%"

It's natural to focus on the witness reliability (80%) and overlook the base rate (only 15% of cabs are Blue). This is frequentist thinking — trusting only the observed data.

✅ The Bayesian Answer: About 41%

Using Bayes' Theorem and considering BOTH the witness reliability AND the base rate:

Scenario 1: Cab is Blue (15% base rate) → Witness says Blue (80% reliable) = 0.15 × 0.80 = 0.12

Scenario 2: Cab is Green (85% base rate) → Witness says Blue (20% wrong) = 0.85 × 0.20 = 0.17

P(Blue | witness says Blue) = 0.12 / (0.12 + 0.17) ≈ 41%

The base rate matters! Most cabs are Green, so even with witness testimony, there's still a good chance it was a misidentified Green cab.

💊 Another Example: The Medical Test

Here's a medical scenario that shows the same pattern:

The Setup:

  • Disease affects 0.1% of people (base rate/prior)
  • Test is 99% accurate
  • You test positive
❌ Frequentist Intuition (Ignoring Prior):

"The test is 99% accurate, so I probably have it."

Error: Ignoring the 0.1% base rate!

✅ Bayesian Reasoning (Using Prior):

"Disease is rare (0.1%). Even with a positive test, I only have ~9% chance."

Considers BOTH the test accuracy AND the base rate.

💡 Why This Happens

Our brains are wired to focus on concrete, immediate evidence (the test result, the witness) rather than abstract statistical information (base rates). This is frequentist thinking — let the data "speak for itself" without considering prior probabilities.

MLE vs MAP: The Mathematical Reflection

This frequentist vs Bayesian intuition shows up in machine learning estimation methods.

MLE: Maximum Likelihood Estimation

Frequentist Approach
Find parameters θ that maximize: P(data | θ)

Translation: "What parameter values make this observed data most likely?"

  • Ignores priors: Only looks at the data you have
  • Pure data-driven: Let the observations speak for themselves
  • Works well with lots of data: When you have 10,000 examples, priors matter less
Example: Coin Flips

You flip a coin 10 times: 7 heads, 3 tails.

MLE estimate: P(heads) = 7/10 = 70%

Just counts the data. Doesn't use prior knowledge that coins are usually fair.

MAP: Maximum A Posteriori Estimation

Bayesian Approach
Find parameters θ that maximize: P(θ | data) ∝ P(data | θ) × P(θ)

Translation: "What parameter values are most likely, given BOTH the data AND my prior beliefs?"

  • Includes priors: P(θ) represents prior knowledge
  • Balances evidence and belief: Combines data with domain knowledge
  • Better with small data: Prior knowledge helps when you have limited observations
Example: Coin Flips

You flip a coin 10 times: 7 heads, 3 tails.

MAP estimate (with prior that coins are usually fair): P(heads) ≈ 55-60%

Balances the observed 70% with prior belief that coins are typically 50/50. Result is pulled toward fairness.

🔗 The Connection

MLE is a special case of MAP where the prior P(θ) is uniform (all values equally likely). When you assume no prior knowledge, MAP reduces to MLE.

MAP with uniform prior = MLE

When Each Approach Makes Sense

Use Frequentist/MLE When:

  • You have lots of data: With 1 million examples, the data overwhelms any prior
  • You have no domain knowledge: If you truly don't know what to expect beforehand
  • You want simplicity: MLE is computationally simpler
Example: Training a neural network on ImageNet (14M images)

Use Bayesian/MAP When:

  • You have limited data: With only 10 coin flips, prior knowledge helps
  • You have domain expertise: Medical diagnosis benefits from known disease prevalence
  • You need uncertainty quantification: Bayesian methods naturally provide confidence intervals
Example: Rare disease diagnosis (few cases, but known prevalence)

🧠 The Lesson for You

Your intuition is frequentist — you naturally focus on immediate evidence and ignore base rates. But Bayesian thinking (considering priors) often gives better answers, especially with limited data. This is why the medical test problem feels so counterintuitive: we have to FIGHT our frequentist intuition to properly incorporate the base rate!

Bayes' Theorem in Action: Medical Testing

The Medical Test Problem

You've tested positive for a rare disease. The test is 99% accurate. Should you panic?

The Setup

  • The disease affects 0.1% of the population (1 in 1,000 people)
  • The test is 99% accurate — it correctly identifies 99% of sick people and 99% of healthy people
  • You just tested POSITIVE

What's the probability you actually have the disease?

Understanding Bayesian Terminology

Prior Probability:

Your initial belief before seeing any evidence. In this case: P(disease) = 0.1% (the base rate in the population)

Posterior Probability:

Your updated belief after seeing evidence. What we're trying to find: P(disease | positive test)

Likelihood:

How likely the evidence is if your hypothesis is true. In this case: P(positive test | disease) = 99%

🔑 Key Insight: Bayes' Theorem tells us how to update our Prior using the Likelihood to get the Posterior

Prior + New Evidence (Likelihood) = Posterior

Interactive Bayesian Calculator

How common is the disease in the population?

How accurate is the test at detecting the disease?

Prior Probability
0.1%

Chance of disease before test

(Based on population prevalence)

New Evidence:

Test Result = POSITIVE ✓

Posterior Probability
9.0%

Chance of disease after positive test

(Updated with test evidence!)

Bayes' Theorem Calculation:
P(disease | positive) = [P(positive | disease) × P(disease)] / P(positive)

The Surprising Truth

With default values (0.1% prevalence, 99% accuracy), even after testing positive, you only have about 9% chance of actually having the disease!

Why? Because the disease is so rare:

  • True positives: Out of 1000 people, 1 has the disease and tests positive (99% of the time)
  • False positives: Out of the 999 healthy people, about 10 test positive by mistake (1% false positive rate)
  • Result: You're one of ~11 people who tested positive, but only ~1 actually has the disease = 9% probability!

This is why doctors often order multiple tests—each positive result updates the probability higher!

Try It Yourself

Experiment 1: Increase disease prevalence to 10%. Notice how the posterior jumps to 91%!

Experiment 2: Lower test accuracy to 90%. See how false positives increase.

Experiment 3: Set prevalence to 50% (coin flip). What happens to the posterior?

Medical testing is life-and-death. But Bayesian thinking and conditional probability have solved mysteries throughout history. Let's look at one of the most famous examples...

Case Study: The Federalist Papers Mystery

📜
Optional: Historical Case Study
How statisticians used conditional probability to solve a 175-year authorship mystery about the Federalist Papers, proving James Madison wrote the disputed essays with odds of 160 billion to 1. Click to read the full story.
(Click to expand)

A 175-Year Mystery Solved by Probability

In 1787-1788, three founding fathers—Alexander Hamilton, James Madison, and John Jay—wrote 85 essays to convince Americans to ratify the new U.S. Constitution. They published all essays anonymously under the pen name "Publius".

📜 The Authorship Dispute

After publication, the authorship of most essays was clear:

  • 5 essays were definitely by John Jay
  • 51 essays were definitely by Alexander Hamilton
  • 17 essays were definitely by James Madison
  • 12 essays were DISPUTED — both Hamilton and Madison claimed authorship!

The disputed papers: Numbers 49-58, 62, and 63

For 175 years, historians debated who wrote these 12 essays. Hamilton died in a duel with Aaron Burr in 1804 without clarifying. Madison died in 1836, also leaving the question unresolved.

How Statisticians Solved It (1964)

Statisticians Frederick Mosteller (Harvard) and David Wallace (University of Chicago) used conditional probability and Bayes' Theorem to solve the mystery.

The Method: Word Frequency Analysis

They analyzed essays where authorship was certain and found discriminating words—small filler words that authors use unconsciously:

Hamilton's Pattern
"upon"
3.24 per 1,000 words
"by"
~4 per 1,000 words
Madison's Pattern
"upon"
0.3 per 1,000 words
"by"
~8 per 1,000 words (2× Hamilton!)

🔑 Key Insight

Hamilton loved "upon" (used it 10× more than Madison). Madison loved "by" (used it 2× more than Hamilton). These aren't conscious choices—they're unconscious writing habits!

Applying Conditional Probability

For each disputed essay, they calculated:

P(Madison | word pattern) =
[P(word pattern | Madison) × P(Madison)] / P(word pattern)

They analyzed about 30 discriminating words. Each word provides independent evidence that updates the probability.

The Verdict: Mathematics Solves History

📊 The Statistical Conclusion

160,000,000,000 : 1

Odds favoring Madison for some essays

That's 160 BILLION to 1!

12,600,000 : 1

Odds for other disputed essays

Still overwhelmingly favoring Madison

✓ SOLVED

All 12 disputed Federalist Papers (49-58, 62-63) were written by James Madison.

After 175 years of historical debate, conditional probability gave us certainty.

Predictions Aren't Perfect: Understanding Uncertainty

Your Model Says $450k... But Will It Sell for Exactly $450,000?

Back in Chapters 1-2, we learned: Price = $100k + $50k × bedrooms + $100k × bathrooms

Example: 3-Bedroom, 2-Bathroom House

Price = $100k + $50k × 3 + $100k × 2 = $450k

The model predicts: $450,000

❓ The Question

Will this house sell for exactly $450,000.00?

❌ No! Here's why:

Even with the same bedrooms and bathrooms, different houses sell for different prices because of factors we didn't measure:

  • Neighborhood quality and school district
  • House condition and recent renovations
  • Yard size and landscaping
  • Market timing (buyer demand that week)
  • Negotiation skills
✅ What the model ACTUALLY tells us:

"Houses with 3 bedrooms and 2 bathrooms sell for $450k on average"

But individual houses vary around this average.

Visualizing the Uncertainty

Let's say we look at 100 houses with 3 bedrooms and 2 bathrooms. Here's what we might observe:

$450k (Mean/Average) $420k $480k Most houses

Understanding the Distribution

Mean = $450k

The model's prediction. The center of the distribution. Most houses cluster around this value.

Standard Deviation ≈ $30k

Measures the "typical" spread. About 68% of houses sell within ±$30k of the mean ($420k-$480k).

95% Range ≈ $390k-$510k

About 95% of houses with these features sell within this range (mean ± 2 standard deviations).

📊 Key Insight: Mean vs. Individual Predictions

The model predicts the average (mean) price. Individual houses vary around this average due to unmeasured factors. This variation forms a distribution (typically bell-shaped/normal).

What This Means for Machine Learning

🎯 Point Prediction

$450k

This is what simple models give you: a single number.

Problem: Doesn't tell you how confident the prediction is!

📈 Prediction with Uncertainty

$450k ±$30k

Modern ML systems provide both prediction AND uncertainty.

✓ Better: "I predict $450k, typically ±$30k"

🎲 Probability Distribution

Advanced models output entire probability distributions.

✓ Best: "Here's the full range of likely prices"

Real-World ML Applications

🏠
Zillow "Zestimate"

Zillow doesn't just predict "$450k"—they show a range: "$420k - $480k" with a confidence level. This is prediction uncertainty in action!

🌡️
Weather Forecasts

"High of 75°F, but could range from 72-78°F" — they're giving you the mean and the uncertainty range.

🤖
Modern LLMs

When ChatGPT or Claude say "I'm fairly confident..." or "I'm not entirely sure...", they're communicating uncertainty based on token probability distributions.

📐
Optional: Technical Terms & Mathematical Formalism
Formal definitions of mean (μ), standard deviation (σ), variance (σ²), and normal distribution. Click to learn the mathematical terminology behind prediction uncertainty.
(Click to expand)

The Technical Terms (Optional Deep Dive)

If you want to understand the formal terminology:

Mean (μ)

The average value. In our example: $450k.

μ = Σ(all prices) / n
Standard Deviation (σ)

How spread out values are from the mean. In our example: ≈$30k.

Smaller σ = tighter predictions. Larger σ = more uncertainty.

Variance (σ²)

Square of standard deviation. Less intuitive but mathematically useful.

In our example: ($30k)² = $900M

Normal Distribution

The bell-shaped curve. Also called Gaussian distribution.

Many natural phenomena (including ML prediction errors) follow this pattern.

🔬 Why This Matters in ML

In linear regression, we assume the prediction errors (the difference between predicted and actual values) follow a normal distribution with:

  • Mean = 0: On average, we're not systematically over or under predicting
  • Standard deviation = σ: Tells us typical error size (e.g., ±$30k)

This lets us say: "95% confident the actual price will be within $390k-$510k"

Connecting Everything: From Probability to Uncertainty

1
Probability Fundamentals

Coin flips, Monty Hall, conditional probability — understanding uncertainty in discrete events

2
Distributions

When outcomes are continuous (like house prices), we use probability distributions to model uncertainty

3
ML Predictions

Modern ML doesn't just predict values — it predicts distributions, giving you both the answer AND the confidence

🎯 The Big Picture

Whether it's flipping coins, predicting house prices, or getting answers from ChatGPT — probability and uncertainty are fundamental. Good ML systems don't just give you answers; they tell you how confident they are in those answers.

We've seen how probability shapes predictions in machine learning—from coin flips to house prices. Now let's explore the most sophisticated probability machines ever built: Large Language Models. Every word you read from ChatGPT, Claude, or Gemini is the result of probability distributions over tens of thousands of possible words.

How Modern LLMs Use Probability

LLMs Are Probability Machines

Every word you see from ChatGPT, Claude, or Gemini is the result of sampling from a probability distribution. Let's understand what that really means.

What Happens When You Type a Prompt

  1. You type: "The capital of France is"
  2. The LLM computes: A probability distribution over ALL possible next words (~50,000 options)
  3. Example distribution:
"Paris"
95%
"located"
2%
"actually"
1%
"Lyon"
0.5%
...49,996 other words 1.5%

Step 4: The model samples a word based on these probabilities. Usually "Paris" (95% likely), but occasionally something else!

Key Insight

LLMs don't "know" facts. They predict probable next tokens.

When Claude says "Paris," it's not retrieving a stored fact—it's predicting the most probable token given billions of training examples where "capital of France is" preceded "Paris."

Temperature: Controlling Randomness

You can control how "creative" or "conservative" an LLM is by adjusting temperature—a parameter that reshapes the probability distribution.

Temperature = 0 (Deterministic)

Always picks the MOST likely token

"Paris"
"located"

✅ Use for: Factual Q&A, code generation, translations

Temperature = 1 (Balanced)

Samples proportionally to original probabilities

"Paris"
"located"

⚖️ Use for: General chat, balanced responses

Temperature = 2 (Creative)

Flattens distribution—more surprises!

"Paris"
"located"

🎨 Use for: Creative writing, brainstorming, poetry

Adjusted Probability = softmax(logits / temperature)

Real-World Example

Prompt: "Write a story opening: 'Once upon a time'"

  • Temperature = 0: "Once upon a time, there was a kingdom..." (predictable)
  • Temperature = 1: "Once upon a time, a young girl discovered..." (balanced)
  • Temperature = 2: "Once upon a time, the universe whispered..." (creative)

Top-K and Top-P Sampling

Besides temperature, LLMs use sampling strategies to decide which tokens are even considered.

Top-K Sampling

Strategy: Only consider the K most likely tokens

K = 3: Only consider "Paris" (95%), "located" (2%), "actually" (1%)

Ignores all 49,997 other tokens, even if they sum to 2%

🎯 Prevents completely nonsensical outputs

Top-P Sampling (Nucleus Sampling)

Strategy: Consider smallest set of tokens that sum to probability P

P = 0.95: Include tokens until cumulative probability ≥ 95%

Adapts to context—sometimes considers 3 tokens, sometimes 50!

🧠 More flexible than Top-K, used by most modern LLMs

Why This Matters

These techniques balance quality (staying probable) with diversity (avoiding repetition). Without them, LLMs either:

  • Repeat the same phrase forever (if always choosing max probability)
  • Generate gibberish (if sampling uniformly from all 50,000 tokens)

So far, we've seen how LLMs sample individual words from probability distributions. But what about complex tasks—like solving math problems or writing code? This brings us to one of the most talked-about topics in AI: "reasoning."

The Truth About AI "Reasoning"

You've probably heard about "reasoning models" like OpenAI's o1 or "chain-of-thought prompting." Let's demystify what "reasoning" actually means in LLMs—spoiler: it's still all about probability.

⚠️ Important: LLMs Don't "Think"

Despite the terminology, LLMs don't have internal mental states, beliefs, or logic. They:

  • Don't maintain symbolic representations of facts
  • Don't perform logical deduction like a theorem prover
  • Don't "understand" in the way humans do

What they DO: Generate sequences of tokens where each token is predicted based on probability distributions learned from training data.

Chain-of-Thought: Reasoning as a Probability Path

"Reasoning" in LLMs means generating intermediate steps that increase the probability of reaching correct final answers.

❌ Direct Answer (Lower Accuracy)

Prompt: "What is 347 × 29?"

"10,063"

Accuracy: ~60%

The model must predict the entire answer in one token, sampling from a distribution where the correct answer might only have 60% probability.

✅ Chain-of-Thought (Higher Accuracy)

Prompt: "What is 347 × 29? Let's think step by step."

"First, 347 × 20 = 6,940
Then, 347 × 9 = 3,123
Adding: 6,940 + 3,123 = 10,063"

Accuracy: ~90%

Each intermediate step has high probability (easier to predict). The path through probability space leads to the correct answer more reliably.

Why Chain-of-Thought Works

It's not magic—it's probability geometry.

By generating intermediate tokens, the model:

  • Narrows the probability distribution at each step
  • Uses each predicted token as context for the next prediction
  • Follows a path through high-probability regions of "token space"

Think of it like the Monty Hall problem: showing intermediate steps updates the probabilities, just like the host revealing a goat!

OpenAI o1 and "Reasoning Models"

Models like OpenAI's o1 take chain-of-thought to the extreme: they generate thousands of internal reasoning tokens before outputting an answer.

How "Reasoning Models" Work

1
User asks a hard question

"Prove that √2 is irrational"

2
Model generates hidden "thinking" tokens

(You don't see these—they're internal to the model)

"Hmm, this is a proof by contradiction..."

"Assume √2 = p/q in lowest terms..."

"Then 2q² = p², so p² is even..."

"Wait, that means p is even too..."

"So p = 2k for some k..."

"Substituting: 2q² = (2k)² = 4k²..."

"Therefore q² = 2k², so q is also even..."

"But that contradicts p/q being in lowest terms!"

Model generated ~500 reasoning tokens internally

3
Model outputs clean answer

"By contradiction: Assume √2 = p/q. Then 2q² = p², implying both p and q are even, contradicting lowest terms. Therefore √2 is irrational."

Traditional LLMs (GPT-4, Claude 3)

  • Generate ~50-100 tokens for reasoning
  • User sees all thinking
  • Good for most tasks

Reasoning Models (o1)

  • Generate ~1,000-10,000 hidden reasoning tokens
  • User only sees final answer
  • Excels at math, coding, complex reasoning

The Breakthrough

More reasoning tokens = exploring more probability paths = higher chance of correct answer

It's like the Frequentist Monty Hall simulator: running more trials gives you better estimates. Here, generating more intermediate tokens lets the model "search" through probability space more thoroughly.

Limitations: What LLMs Still Can't Do

Even with "reasoning," LLMs have fundamental limits because they're statistical pattern matchers, not logical engines.

❌ True Logical Deduction

An LLM might generate the text "Therefore A implies B" even if A doesn't actually imply B—it's predicting plausible text, not performing symbolic logic.

❌ Guaranteed Correctness

Even o1 can make errors on math problems. It's just much less likely to, because it explores more probability paths. But probability ≠ certainty.

❌ Novel Reasoning

LLMs can only generate reasoning patterns seen in training data. They can't invent genuinely new proof techniques or logical frameworks.

❌ Transparent Reasoning

With models like o1, you can't audit the reasoning—it's hidden. You're trusting a probability distribution, not verifiable logic.

The Bayesian View

Think of LLM "reasoning" through a Bayesian lens:

  • Prior: Initial probability distribution over possible answers
  • Evidence: Each reasoning token the model generates
  • Posterior: Updated probability after considering reasoning steps

Chain-of-thought is essentially Bayesian updating in token space—each step refines the distribution until a high-confidence answer emerges.

Chapter Summary

Key Takeaways

1

Probability Quantifies Uncertainty

ML predictions aren't guarantees—they're probabilistic statements about what's likely, given the data.

2

Two Philosophies: Frequentist vs Bayesian

Frequentist: Long-run frequency (repeat experiments)

Bayesian: Degree of belief (update with evidence)

Both are valid, both are used in ML.

3

Sigmoid & Softmax Convert to Probabilities

Sigmoid: Squashes outputs to [0,1] for binary classification

Softmax: Creates probability distribution over multiple classes

4

Cross-Entropy Measures "Surprise"

Loss = -log(p) penalizes confident wrong predictions heavily

This is why we use logarithms in loss functions!

5

LLMs are Probability Machines

From training on billions of examples to generating text token-by-token, probability is the foundation of how modern AI works.

Mathematical Cheat Sheet

Probability Axioms
0 ≤ P(A) ≤ 1
P(all outcomes) = 1
P(not A) = 1 - P(A)
Bayes' Theorem
P(A|B) = P(B|A) × P(A) / P(B)

Update beliefs with new evidence

Sigmoid Function
σ(z) = 1 / (1 + e-z)

Converts any number to probability [0,1]

Softmax Function
softmax(zi) = ezi / Σj ezj

Probability distribution over K classes

Cross-Entropy Loss
L = -Σ yi log(pi)

Measures surprise / prediction quality

Test Your Understanding

Test what you've learned in this chapter!

1. In the Monty Hall problem with 3 doors, what is the probability of winning if you ALWAYS switch after the host opens a door?

2. What is the main difference between Frequentist and Bayesian probability?

3. When an LLM uses Temperature = 0, what does it do?

4. What is "chain-of-thought" reasoning in LLMs?

Looking Ahead: Probability in Advanced Chapters

How Probability Appears in Future Chapters

Ch 8

Non-Linearity

Dropout: Randomly "turning off" neurons during training

→ Uses Bernoulli probability (coin flip for each neuron)

Ch 9

Attention Mechanisms

Softmax: Creates probability distribution over words

→ "Which words should I focus on?" = weighted by probabilities

Ch 10

Modern LLM Architectures

Mixture of Experts: Routing tokens to different expert models

→ Router outputs probabilities: "Send 70% of this token to Expert 1"

Ch 13

RAG Systems

Document Ranking: Which documents are most relevant?

→ Cosine similarity scores interpreted as probabilities

You're Now Prepared!

With this probability foundation, concepts like softmax, dropout, and uncertainty estimation will make intuitive sense when you encounter them.