Why machine learning predictions are probabilities, not guarantees
Imagine you run a coffee shop. Every morning, you need to decide: How many blueberry muffins should I bake?
For Thursday, should you bake exactly 20 muffins?
You can't know the exact number. Thursday could bring 15 customers or 25. Weather changes, events happen, people get sick.
You can't predict the future with certainty. But you CAN estimate probabilities.
Instead of saying "I'll sell exactly 20 muffins," you think:
This is probability: quantifying uncertainty with numbers.
In Chapter 2, we saw the spam classifier output "0.95" for an email. What does that mean?
"This email is DEFINITELY spam"
"I'm 95% confident this is spam"
(There's still a 5% chance it's legitimate)
Every ML prediction is a probability statement about uncertainty.
Even when your model says "99% spam," it's saying "Based on patterns I learned, I believe there's a 99% chance this is spam." It's not omniscient—it's making an educated guess using data.
Probability is a number between 0 and 1 that represents how likely an event is to happen.
You can't have -0.3 probability or 1.5 probability. Only values from 0 to 1 make sense.
Example: When you flip a coin, either heads (H) or tails (T) must happen:
Example: If probability of rain is 0.3, then probability of no rain is:
Let's see probability in action. Flip a coin many times and watch the proportion of heads approach 0.5.
After 10 flips: proportion might be 0.3 or 0.7 (far from 0.5)
After 100 flips: proportion gets closer to 0.5
After 1000 flips: proportion is very close to 0.5
This is called the Law of Large Numbers: As you collect more data, observed frequencies approach true probabilities.
Imagine you're on a game show. There are 3 doors. Behind one door is a car (the prize you want). Behind the other two doors are goats (you don't want goats).
Door 1
Door 2
Door 3
Door 1
Your original choice
Door 2
Still closed
Door 3
OPENED - Goat revealed
"Well, there are now 2 doors left. One has a car, one has a goat. So it's 50/50, right? It doesn't matter if I stay or switch!"
This seems totally logical. After all, there are only 2 options remaining.
If you SWITCH to Door 2, you have a 2/3 (66.7%) chance of winning the car!
If you STAY with Door 1, you only have a 1/3 (33.3%) chance of winning!
Wait... what?! How is it NOT 50/50?? 🤯
I know what you're thinking: "Two doors, so 50/50." That was my first thought too.
But something changed when Monty opened Door 3. That action gave us new information—and the probabilities shifted.
To understand why, we need to learn about conditional probability—the most important concept in this entire chapter.
Conditional probability is just: "What's the probability of something happening, NOW THAT I know something else?"
Question: What's the probability it will rain today?
P(rain)
Just based on weather forecast for your city
You look outside and see dark clouds!
Question: What's the probability it will rain today, GIVEN THAT the sky has dark clouds?
P(rain | dark clouds)
This is conditional probability!
The vertical bar "|" means "given that" or "knowing that"
↓
"The probability it will rain,
given that I know there are dark clouds"
Key Insight: The "|" separates what you're calculating (left side) from what you already know (right side)
Without information:
P(get hired) = 5%
Only 5% of applicants get hired
Knowing you have 10 years experience:
P(hired | 10 yrs exp) = 40%
New information changes the probability!
Without information:
P(traffic jam) = 20%
On a typical day
Knowing it's rush hour:
P(traffic | rush hour) = 75%
Much higher during rush hour!
Without information:
P(roll a 6) = 1/6
Any number is equally likely
Knowing you rolled an even number:
P(6 | even) = 1/3
Only 2, 4, or 6 are possible now!
Conditional probability is how probabilities update when you learn new information. The new information changes what's possible, so the probabilities change too!
Now we can understand why switching works in the Monty Hall problem!
You pick Door 1. The car is equally likely behind any door:
Now we use conditional probability:
Your original choice doesn't change
Door 2 gets all the probability from Door 3!
Monty's action (opening a door) gives you new information. The conditional probability P(car at Door 2 | Monty opened Door 3) = 2/3 is higher than your original choice (1/3), so switching doubles your chances!
Conditional probability is the foundation. But there are two completely different ways to think about probability itself. Understanding both will reveal why machine learning works the way it does...
There are TWO different philosophies for what "probability" means. The famous Monty Hall problem shows this difference beautifully.
You're on a game show. There are 3 doors. Behind one is a car 🚗, behind the other two are goats 🐐.
Surprisingly: You should ALWAYS switch! But why? Let's see how Frequentist and Bayesian thinkers arrive at this answer...
A Frequentist says: "I don't speculate. Let me play this game many times and see what happens."
"After 1000 games, switching wins ~67% of the time. Therefore, switch!"
The Frequentist discovers the answer through experimentation and observation of long-run frequencies.
A Bayesian says: "I start with initial beliefs, then UPDATE them as I get new information."
You pick Door 1. Each door equally likely has the car:
NEW INFORMATION: Host revealed a goat behind Door 3
Key insight: The host knows where the car is and will NEVER open the car door!
When you picked Door 1, there was a 2/3 chance the car was behind Door 2 OR Door 3. The host just eliminated Door 3, so all that probability flows to Door 2!
"I'm 67% confident the car is behind Door 2. Switch!"
The Bayesian deduces the answer through logical reasoning and belief updating from a single game.
Still not convinced switching helps? Let's scale up the problem to make the logic crystal clear.
Probability you picked correctly: 1/100 = 1%
Probability the car is behind one of the other 99 doors: 99/100 = 99%
Only your Door #1 and Door #47 remain closed.
The host knows where the car is. He deliberately left Door #47 closed.
Your initial random guess
All the other 99 doors' probability!
OBVIOUSLY you should switch!
With 100 doors, it's intuitive that:
The same logic applies to 3 doors, but with 100 doors, our intuition finally catches up!
| Aspect | Frequentist | Bayesian |
|---|---|---|
| Probability means... | Long-run frequency in repeated trials | Degree of belief given available information |
| How they solve Monty Hall... | Simulate 1000 games, observe switching wins 67% | Update beliefs from 33%/33%/33% to 33%/67%/0% |
| Answer comes from... | Empirical observation | Logical deduction |
| Best for... | Repeatable experiments, quality control, A/B testing | One-time decisions, incorporating prior knowledge, sequential learning |
| In Machine Learning... | Training models on large datasets | Online learning, spam filters, recommendation systems |
Both! They complement each other:
Modern AI systems use both approaches. Neural networks are trained with frequentist methods but make predictions that are interpreted as Bayesian probabilities!
Cognitive psychology research has discovered something fascinating: we naturally tend to think like frequentists. When making decisions, we often ignore prior probabilities (base rates) and focus only on the immediate evidence in front of us.
When we make decisions, we often commit what psychologists call the "base rate fallacy" — ignoring general probability information (priors) in favor of specific case information.
The Facts:
Question: What's the probability it was actually a Blue cab?
It's natural to focus on the witness reliability (80%) and overlook the base rate (only 15% of cabs are Blue). This is frequentist thinking — trusting only the observed data.
Using Bayes' Theorem and considering BOTH the witness reliability AND the base rate:
Scenario 1: Cab is Blue (15% base rate) → Witness says Blue (80% reliable) = 0.15 × 0.80 = 0.12
Scenario 2: Cab is Green (85% base rate) → Witness says Blue (20% wrong) = 0.85 × 0.20 = 0.17
P(Blue | witness says Blue) = 0.12 / (0.12 + 0.17) ≈ 41%
The base rate matters! Most cabs are Green, so even with witness testimony, there's still a good chance it was a misidentified Green cab.
Here's a medical scenario that shows the same pattern:
The Setup:
"The test is 99% accurate, so I probably have it."
Error: Ignoring the 0.1% base rate!
"Disease is rare (0.1%). Even with a positive test, I only have ~9% chance."
Considers BOTH the test accuracy AND the base rate.
Our brains are wired to focus on concrete, immediate evidence (the test result, the witness) rather than abstract statistical information (base rates). This is frequentist thinking — let the data "speak for itself" without considering prior probabilities.
This frequentist vs Bayesian intuition shows up in machine learning estimation methods.
Translation: "What parameter values make this observed data most likely?"
You flip a coin 10 times: 7 heads, 3 tails.
MLE estimate: P(heads) = 7/10 = 70%
Just counts the data. Doesn't use prior knowledge that coins are usually fair.
Translation: "What parameter values are most likely, given BOTH the data AND my prior beliefs?"
You flip a coin 10 times: 7 heads, 3 tails.
MAP estimate (with prior that coins are usually fair): P(heads) ≈ 55-60%
Balances the observed 70% with prior belief that coins are typically 50/50. Result is pulled toward fairness.
MLE is a special case of MAP where the prior P(θ) is uniform (all values equally likely). When you assume no prior knowledge, MAP reduces to MLE.
MAP with uniform prior = MLE
Your intuition is frequentist — you naturally focus on immediate evidence and ignore base rates. But Bayesian thinking (considering priors) often gives better answers, especially with limited data. This is why the medical test problem feels so counterintuitive: we have to FIGHT our frequentist intuition to properly incorporate the base rate!
You've tested positive for a rare disease. The test is 99% accurate. Should you panic?
What's the probability you actually have the disease?
Your initial belief before seeing any evidence. In this case: P(disease) = 0.1% (the base rate in the population)
Your updated belief after seeing evidence. What we're trying to find: P(disease | positive test)
How likely the evidence is if your hypothesis is true. In this case: P(positive test | disease) = 99%
🔑 Key Insight: Bayes' Theorem tells us how to update our Prior using the Likelihood to get the Posterior
Prior + New Evidence (Likelihood) = Posterior
How common is the disease in the population?
How accurate is the test at detecting the disease?
Chance of disease before test
(Based on population prevalence)
New Evidence:
Test Result = POSITIVE ✓
Chance of disease after positive test
(Updated with test evidence!)
With default values (0.1% prevalence, 99% accuracy), even after testing positive, you only have about 9% chance of actually having the disease!
Why? Because the disease is so rare:
This is why doctors often order multiple tests—each positive result updates the probability higher!
Experiment 1: Increase disease prevalence to 10%. Notice how the posterior jumps to 91%!
Experiment 2: Lower test accuracy to 90%. See how false positives increase.
Experiment 3: Set prevalence to 50% (coin flip). What happens to the posterior?
Medical testing is life-and-death. But Bayesian thinking and conditional probability have solved mysteries throughout history. Let's look at one of the most famous examples...
In 1787-1788, three founding fathers—Alexander Hamilton, James Madison, and John Jay—wrote 85 essays to convince Americans to ratify the new U.S. Constitution. They published all essays anonymously under the pen name "Publius".
After publication, the authorship of most essays was clear:
The disputed papers: Numbers 49-58, 62, and 63
For 175 years, historians debated who wrote these 12 essays. Hamilton died in a duel with Aaron Burr in 1804 without clarifying. Madison died in 1836, also leaving the question unresolved.
Statisticians Frederick Mosteller (Harvard) and David Wallace (University of Chicago) used conditional probability and Bayes' Theorem to solve the mystery.
They analyzed essays where authorship was certain and found discriminating words—small filler words that authors use unconsciously:
Hamilton loved "upon" (used it 10× more than Madison). Madison loved "by" (used it 2× more than Hamilton). These aren't conscious choices—they're unconscious writing habits!
For each disputed essay, they calculated:
They analyzed about 30 discriminating words. Each word provides independent evidence that updates the probability.
Odds favoring Madison for some essays
That's 160 BILLION to 1!
Odds for other disputed essays
Still overwhelmingly favoring Madison
All 12 disputed Federalist Papers (49-58, 62-63) were written by James Madison.
After 175 years of historical debate, conditional probability gave us certainty.
Back in Chapters 1-2, we learned: Price = $100k + $50k × bedrooms + $100k × bathrooms
The model predicts: $450,000
Will this house sell for exactly $450,000.00?
Even with the same bedrooms and bathrooms, different houses sell for different prices because of factors we didn't measure:
"Houses with 3 bedrooms and 2 bathrooms sell for $450k on average"
But individual houses vary around this average.
Let's say we look at 100 houses with 3 bedrooms and 2 bathrooms. Here's what we might observe:
The model's prediction. The center of the distribution. Most houses cluster around this value.
Measures the "typical" spread. About 68% of houses sell within ±$30k of the mean ($420k-$480k).
About 95% of houses with these features sell within this range (mean ± 2 standard deviations).
The model predicts the average (mean) price. Individual houses vary around this average due to unmeasured factors. This variation forms a distribution (typically bell-shaped/normal).
This is what simple models give you: a single number.
Problem: Doesn't tell you how confident the prediction is!
Modern ML systems provide both prediction AND uncertainty.
✓ Better: "I predict $450k, typically ±$30k"
Advanced models output entire probability distributions.
✓ Best: "Here's the full range of likely prices"
Zillow doesn't just predict "$450k"—they show a range: "$420k - $480k" with a confidence level. This is prediction uncertainty in action!
"High of 75°F, but could range from 72-78°F" — they're giving you the mean and the uncertainty range.
When ChatGPT or Claude say "I'm fairly confident..." or "I'm not entirely sure...", they're communicating uncertainty based on token probability distributions.
If you want to understand the formal terminology:
The average value. In our example: $450k.
How spread out values are from the mean. In our example: ≈$30k.
Smaller σ = tighter predictions. Larger σ = more uncertainty.
Square of standard deviation. Less intuitive but mathematically useful.
In our example: ($30k)² = $900M
The bell-shaped curve. Also called Gaussian distribution.
Many natural phenomena (including ML prediction errors) follow this pattern.
In linear regression, we assume the prediction errors (the difference between predicted and actual values) follow a normal distribution with:
This lets us say: "95% confident the actual price will be within $390k-$510k"
Coin flips, Monty Hall, conditional probability — understanding uncertainty in discrete events
When outcomes are continuous (like house prices), we use probability distributions to model uncertainty
Modern ML doesn't just predict values — it predicts distributions, giving you both the answer AND the confidence
Whether it's flipping coins, predicting house prices, or getting answers from ChatGPT — probability and uncertainty are fundamental. Good ML systems don't just give you answers; they tell you how confident they are in those answers.
We've seen how probability shapes predictions in machine learning—from coin flips to house prices. Now let's explore the most sophisticated probability machines ever built: Large Language Models. Every word you read from ChatGPT, Claude, or Gemini is the result of probability distributions over tens of thousands of possible words.
Every word you see from ChatGPT, Claude, or Gemini is the result of sampling from a probability distribution. Let's understand what that really means.
Step 4: The model samples a word based on these probabilities. Usually "Paris" (95% likely), but occasionally something else!
LLMs don't "know" facts. They predict probable next tokens.
When Claude says "Paris," it's not retrieving a stored fact—it's predicting the most probable token given billions of training examples where "capital of France is" preceded "Paris."
You can control how "creative" or "conservative" an LLM is by adjusting temperature—a parameter that reshapes the probability distribution.
Always picks the MOST likely token
✅ Use for: Factual Q&A, code generation, translations
Samples proportionally to original probabilities
⚖️ Use for: General chat, balanced responses
Flattens distribution—more surprises!
🎨 Use for: Creative writing, brainstorming, poetry
Prompt: "Write a story opening: 'Once upon a time'"
Besides temperature, LLMs use sampling strategies to decide which tokens are even considered.
Strategy: Only consider the K most likely tokens
K = 3: Only consider "Paris" (95%), "located" (2%), "actually" (1%)
Ignores all 49,997 other tokens, even if they sum to 2%
🎯 Prevents completely nonsensical outputs
Strategy: Consider smallest set of tokens that sum to probability P
P = 0.95: Include tokens until cumulative probability ≥ 95%
Adapts to context—sometimes considers 3 tokens, sometimes 50!
🧠 More flexible than Top-K, used by most modern LLMs
These techniques balance quality (staying probable) with diversity (avoiding repetition). Without them, LLMs either:
So far, we've seen how LLMs sample individual words from probability distributions. But what about complex tasks—like solving math problems or writing code? This brings us to one of the most talked-about topics in AI: "reasoning."
You've probably heard about "reasoning models" like OpenAI's o1 or "chain-of-thought prompting." Let's demystify what "reasoning" actually means in LLMs—spoiler: it's still all about probability.
Despite the terminology, LLMs don't have internal mental states, beliefs, or logic. They:
What they DO: Generate sequences of tokens where each token is predicted based on probability distributions learned from training data.
"Reasoning" in LLMs means generating intermediate steps that increase the probability of reaching correct final answers.
Prompt: "What is 347 × 29?"
"10,063"
Accuracy: ~60%
The model must predict the entire answer in one token, sampling from a distribution where the correct answer might only have 60% probability.
Prompt: "What is 347 × 29? Let's think step by step."
"First, 347 × 20 = 6,940
Then, 347 × 9 = 3,123
Adding: 6,940 + 3,123 = 10,063"
Accuracy: ~90%
Each intermediate step has high probability (easier to predict). The path through probability space leads to the correct answer more reliably.
It's not magic—it's probability geometry.
By generating intermediate tokens, the model:
Think of it like the Monty Hall problem: showing intermediate steps updates the probabilities, just like the host revealing a goat!
Models like OpenAI's o1 take chain-of-thought to the extreme: they generate thousands of internal reasoning tokens before outputting an answer.
"Prove that √2 is irrational"
(You don't see these—they're internal to the model)
"Hmm, this is a proof by contradiction..."
"Assume √2 = p/q in lowest terms..."
"Then 2q² = p², so p² is even..."
"Wait, that means p is even too..."
"So p = 2k for some k..."
"Substituting: 2q² = (2k)² = 4k²..."
"Therefore q² = 2k², so q is also even..."
"But that contradicts p/q being in lowest terms!"
Model generated ~500 reasoning tokens internally
"By contradiction: Assume √2 = p/q. Then 2q² = p², implying both p and q are even, contradicting lowest terms. Therefore √2 is irrational."
More reasoning tokens = exploring more probability paths = higher chance of correct answer
It's like the Frequentist Monty Hall simulator: running more trials gives you better estimates. Here, generating more intermediate tokens lets the model "search" through probability space more thoroughly.
Even with "reasoning," LLMs have fundamental limits because they're statistical pattern matchers, not logical engines.
An LLM might generate the text "Therefore A implies B" even if A doesn't actually imply B—it's predicting plausible text, not performing symbolic logic.
Even o1 can make errors on math problems. It's just much less likely to, because it explores more probability paths. But probability ≠ certainty.
LLMs can only generate reasoning patterns seen in training data. They can't invent genuinely new proof techniques or logical frameworks.
With models like o1, you can't audit the reasoning—it's hidden. You're trusting a probability distribution, not verifiable logic.
Think of LLM "reasoning" through a Bayesian lens:
Chain-of-thought is essentially Bayesian updating in token space—each step refines the distribution until a high-confidence answer emerges.
ML predictions aren't guarantees—they're probabilistic statements about what's likely, given the data.
Frequentist: Long-run frequency (repeat experiments)
Bayesian: Degree of belief (update with evidence)
Both are valid, both are used in ML.
Sigmoid: Squashes outputs to [0,1] for binary classification
Softmax: Creates probability distribution over multiple classes
Loss = -log(p) penalizes confident wrong predictions heavily
This is why we use logarithms in loss functions!
From training on billions of examples to generating text token-by-token, probability is the foundation of how modern AI works.
Update beliefs with new evidence
Converts any number to probability [0,1]
Probability distribution over K classes
Measures surprise / prediction quality
Test what you've learned in this chapter!
Dropout: Randomly "turning off" neurons during training
→ Uses Bernoulli probability (coin flip for each neuron)
Softmax: Creates probability distribution over words
→ "Which words should I focus on?" = weighted by probabilities
Mixture of Experts: Routing tokens to different expert models
→ Router outputs probabilities: "Send 70% of this token to Expert 1"
Document Ranking: Which documents are most relevant?
→ Cosine similarity scores interpreted as probabilities
With this probability foundation, concepts like softmax, dropout, and uncertainty estimation will make intuitive sense when you encounter them.