From numbers to vectors to matrices—the foundation of neural networks
Tables that organize multiple customers/items into rows, with features in columns. Like an Excel spreadsheet—but for neural networks.
They enable batch processing—handle 1,000 customers as easily as 1 customer. This enables GPU parallelization used in modern AI systems.
Matrix multiplication transforms inputs: Customer data × Weights = Predictions. Same formula scales from 3 customers to 3 million.
The sections below explore these concepts with detailed examples, visuals, and business applications. Some sections include optional deep dives for those interested.
Let's build up from the simplest to more complex structures. Every machine learning model works with these three building blocks:
Just one value. Example: a customer's tenure in months
Multiple values in a row. Example: one customer with 2 features (months, usage)
Shape: 1 row, 2 numbers → This is actually a 1×2 matrix!
A table of numbers. Example: 3 customers, each with 2 features
Shape: 3 rows, 2 columns → 3×2 matrix
A vector is just a special case of a matrix. The vector [10, 35] is actually
a 1×2 matrix (1 row, 2 columns). When we learned about vectors in Chapter 5,
we were already learning about matrices!
When we describe a matrix, we specify its shape as: rows × columns
4×2 matrix (4 rows, 2 columns)
In Chapter 5, we learned how to compare vectors using dot products. Now let's explore how to scale this up efficiently.
Imagine having 10,000 customers and needing to make predictions for all of them:
Each customer processed sequentially. With 1 million customers, that's 1 million separate operations.
With matrices, we can process ALL customers in a single operation:
All customers processed simultaneously. Same code works for 10 customers or 10 million.
Matrix operations can be parallelized—multiple calculations happening at the exact same time. This is why GPUs (Graphics Processing Units) are essential for AI:
Handles operations sequentially
Great for complex logic
Typically 8-16 cores
Handles operations in parallel
Optimized for matrix math
Thousands of cores working simultaneously
Example: Training GPT-3 on CPUs would take decades. On GPUs processing matrices in parallel? Weeks.
Matrices don't just organize data—they enable batch processing and parallel computation. Without matrices, modern AI simply wouldn't be practical. Training a language model would take years instead of days.
Now that we understand what matrices are and how to organize data in them, let's learn how to multiply two matrices together. This operation is fundamental to neural networks—it's how data flows through each layer and gets transformed. We'll start with a simple example and build our understanding step by step.
Remember the dot product from Chapter 5? Multiply corresponding elements and add them up. Now we'll use it repeatedly—that's matrix multiplication.
Let's compute a simple score for one customer using the dot product:
Result: One customer → One dot product → One score (12)
Now let's say we have 3 customers, and we want predictions for all of them. Let's use simple data:
Our weights: b = [1, 2] → b₁ = 1, b₂ = 2
Instead of treating each customer separately, let's stack their data into one matrix:
This is a 3×2 matrix (3 rows, 2 columns). Each row is one customer.
Here's how we calculate ALL 3 predictions at once. Each row does one dot product with the weight column:
Same for rows 2 and 3: [1,2]·[1,2] = 5 and [0,5]·[1,2] = 10
This is matrix multiplication! We did 3 dot products (one for each row) and got 3 predictions all at once. Every row of the first matrix did a dot product with the column of the second matrix.
Now that we've seen the complete example, here's the general pattern:
The Rule: Each row does a dot product with the column. That's all matrix multiplication is!
Let's verify our understanding. Can we predict what the first element of the result will be?
First element: Take row 1 [2, 3] and do dot product with column [4, 2]
Second element: Take row 2 [5, 1] and do dot product with column [4, 2]
Final answer: [14, 22]
This simple operation—doing multiple dot products at once—is THE fundamental operation in neural networks. Every prediction, every layer, every training step uses matrix multiplication billions of times.
Matrix multiplication = one dot product per row. That's it. This simple pattern powers all of modern AI.
Click on any result cell to see how it's calculated! Watch the magic of matrix multiplication.
Phase 1: The Prototype (3 Customers)
A simple model tracking just 2 features for 3 test customers:
Three months later... The boss walks in: "This model works great! Now let's roll it out to our entire customer base — 10,000 customers."
But there's a catch. The data science team says: "For accurate predictions at scale, we need richer customer data. We're now tracking 5 features per customer:"
The data pipeline is updated to pull these 5 features for all 10,000 customers.
Let's see what happened visually. Here's what worked before:
Now here's what breaks when trying to scale:
Why the mismatch? The production data now has 5 features per customer (tenure, usage, tickets, logins, features adopted), but the old model only has 2 weights (one for tenure, one for usage).
We can't compute a dot product between a 5-element customer vector and a 2-element weight vector — there's no way to pair up all the features! The 3 new features (tickets, logins, features) have no corresponding weights. The shapes simply don't fit together.
Matrix multiplication isn't just about scaling up. The shapes must fit together like puzzle pieces. Scaling from 3 to 10,000 customers requires understanding why these shape constraints exist.
Matrix multiplication works by doing multiple dot products. Since dot products require vectors of the same length, the shapes of our matrices must be compatible.
The middle numbers must match. This ensures each row from the first matrix can pair with each column from the second matrix.
✓ Middle numbers match (both 2)
✗ Middle numbers don't match (3 ≠ 2)
Test your shape-matching skills! Can these matrices multiply? Get 5 in a row correct!
We've learned matrix multiplication: Y = XW. Let's test it on a brand new customer who just signed up:
A health score of zero suggests this customer will churn immediately. But they just made a conscious decision to sign up and pay for the product! They should start at some reasonable baseline—not at total failure.
Our formula Y = XW is locked to zero when all inputs are zero. It can't capture the reality that new customers start at some baseline level before their behavior matters. We need a way to shift the entire prediction up or down.
Remember from Chapter 1: bias provides this baseline
Now let's see how bias works with matrices...
In Chapter 1, we learned that bias is the baseline value that exists even when all inputs are zero—like the land value in house prices. The formula was: y = w₁×x₁ + w₂×x₂ + bias.
With matrices processing multiple customers at once, bias works the same way—it's added to every prediction:
Multiply inputs by weights, then add the baseline (bias)
Starting at 5 represents a neutral baseline—not doomed to fail, but not proven successful yet.
Now that we understand why bias matters, here's how it fits into matrix calculations:
When processing multiple data points at once:
Good news: We now understand the math behind neural networks! In practice, deep learning frameworks provide optimized, GPU-accelerated implementations of these matrix operations. Common frameworks include TensorFlow and PyTorch. But what do these names even mean?
Let's break down the name:
Tensor: A fancy mathematical word for "arrays of numbers"
Flow: The flow of data through computations
The data "flows" through layers: input → matrix multiply → add bias → output
Put it together: TensorFlow is a framework that makes tensors flow through neural network computations
(PyTorch is similar—it also works with tensors, but uses a slightly different approach)
We just saw that PyTorch (one of those neural network tools) uses WT (W-transpose). What does transpose mean? It's simple: flip the matrix along its diagonal - rows become columns and columns become rows.
Row 1 of W → Column 1 of WT: [1, 2, 3] becomes a column
Row 2 of W → Column 2 of WT: [4, 5, 6] becomes a column
Notice: (2×3) becomes (3×2) - dimensions flip!
(Remember: PyTorch is the tensor-based tool we explained earlier)
PyTorch stores weights as (out_features, in_features) instead of (in_features, out_features). To make the matrix multiplication work, it transposes the weights during computation.
Let's work through a complete example showing how bias is added after matrix multiplication. This is exactly how the math works inside tools like TensorFlow and PyTorch.
[[4, 2], [1, 2], [0, 5]] [1, 2] 3 First, multiply the data by weights - exactly what we learned earlier:
Now add the bias value to EACH prediction:
Y (capital) because we're predicting for multiple customers at once. The bias shifts each prediction by the same amount. It's added after matrix multiplication!
In our example, we predicted one thing per customer: their health score. But what if we wanted to predict multiple things at once?
Imagine we want to predict TWO scores for each customer:
Now we're predicting 2 values per customer, not just 1!
Key insight: When there are 2 output features (health score + satisfaction score), we need 2 bias values—one baseline for each output type. Health and satisfaction might have different baselines!
Let's calculate predictions for Customer 1 with 2 outputs:
Why different bias values? Health scores and satisfaction scores measure different things with different scales. Health might naturally center around 10, while satisfaction centers around 50. Each output needs its own baseline!
Number of bias values = Number of output features
Each output dimension gets its own baseline adjustment, independent of the others.
Just like weights, bias values are learned during training. The model adjusts both weights and biases to minimize prediction errors.
Control the slope or direction of the relationship
Controls the offset or baseline of the output
We've been using neurons this whole time! A neuron is simply one output prediction.
In our example:
Each neuron does the math: takes all inputs, multiplies by its weights, adds them up, then adds its bias. That's it! When people say "a neural network has millions of neurons," they mean millions of these simple calculations.
When we say a layer has a certain number of parameters, we mean weights + biases:
We used customer data as our teaching example, but here's what's powerful: the exact same matrix operations work for ANY kind of data. The math doesn't care whether we're processing customer information, text, images, or audio.
Different data, identical math. Whether we're processing customer data with 2 features or word embeddings with 768 features, we're doing the same matrix multiplication operation. The network architecture, the training process, the gradient descent—it's all the same fundamental mathematics.
This is why understanding matrix multiplication with simple customer examples gives us the foundation to understand how GPT-4, Claude, DALL-E, and every other neural network actually works!
When these systems go to production, the matrices get much bigger, but the operations remain identical:
The operation is identical—just bigger. A production LLM doing billions of matrix multiplications per second is using the exact same math we just learned with our tiny 3×2 customer matrix. The only differences are scale and speed.
Understanding matrices helps us decode AI vendor pricing and make informed infrastructure decisions. Here's the core insight that can save companies millions:
Every AI pricing model comes down to three matrix-related factors:
Larger matrices = more compute = higher cost. This single equation explains most AI pricing.
A vendor quotes $5,000/month for GPU instances vs $1,000/month for CPUs to analyze customer feedback.
GPUs parallelize matrix multiplication—processing thousands of calculations simultaneously. Critical for real-time responses (like chatbots), but for overnight batch analysis of 10,000 reviews, CPUs work fine at 20% the cost.
Result: Companies routinely save 60-80% by matching compute type to actual use case.
Core concepts from this chapter:
A matrix is a 2D grid of numbers that organizes data efficiently. Each row typically represents one data point (like a customer), and each column represents one feature (like age or usage hours).
Matrix multiplication lets us process all data points simultaneously. Instead of predicting one customer at a time, we multiply a data matrix by a weight matrix and get predictions for everyone at once.
To multiply matrices A × B, the number of columns in A must equal the number of rows in B. If A is (3×2) and B is (2×1), the result is (3×1). This "shape rule" determines what operations are possible.
The formula y = Wx + b includes bias (b) which shifts predictions. Without bias, all predictions must pass through zero. Bias gives models flexibility to represent real-world relationships.
Neural networks stack matrix operations. A hidden layer transforms inputs into new features that capture complex patterns. Each layer output becomes the next layer's input, building abstraction.
The same matrix multiplication formula works for 3 customers or 3 million, 2 features or 2,000. Modern AI scales by using the same mathematical operations on larger matrices, processed in parallel on GPUs.
The Core Formula:
y = Wx + b
This simple equation—matrix multiplication plus bias—is the foundation of every neural network layer. Stack it, add non-linearity, and we get deep learning.
We've just learned the math behind deep learning—neural networks with weights and biases. But this is just one of four fundamental approaches to building intelligent systems. Understanding all four helps recognize which type of AI powers different applications and why.
Every AI system falls into one (or a combination) of these four categories, depending on how it learns and reasons. Let's explore each one in depth.
The oldest form of AI—systems that follow hand-written logic and rules programmed by humans.
Programmers write explicit rules: if condition then action
No learning occurs. The system only knows what humans explicitly programmed. If the situation wasn't anticipated in the rules, the system can't handle it.
Bottom line: Great for well-defined, unchanging problems. Not suitable for complex, evolving situations.
Systems that learn patterns from data using mathematical algorithms—the foundation of most practical ML applications today.
Engineers manually define features (decide which data matters), then the algorithm learns patterns from examples.
Key difference from deep learning: Engineers define what features to look at (email length, sender domain, word frequency). The algorithm figures out the weights, but not the features.
Bottom line: The workhorse of production ML. Powers most business applications with structured data and clear features.
Neural networks with multiple layers that learn both features and patterns automatically from raw data.
This is what we've been learning! Multiple layers of matrices (Y = XW + b) stacked together, where:
Key difference from statistical ML: No need to engineer features—the network discovers them automatically during training.
Everything we learned in Chapters 1-6 builds the foundation:
Bottom line: The breakthrough technology behind modern AI. With lots of data and complex patterns (images, language, audio), deep learning dominates.
Learning by trial and error through interaction with an environment, receiving rewards or penalties for actions.
An agent takes actions in an environment, receives rewards (positive feedback) or penalties (negative feedback), and learns which actions lead to the best long-term outcomes.
Key insight: No labeled training data needed—the system learns from consequences of its actions.
Bottom line: Ideal for sequential decision-making where success can be defined but not the exact steps to get there.
| Aspect | Rule-Based | Statistical ML | Deep Learning | Reinforcement |
|---|---|---|---|---|
| Learning? | No | Yes | Yes | Yes |
| Data Needed | None | Hundreds-Thousands | Millions+ | Millions of trials |
| Feature Engineering | N/A | Manual | Automatic | Varies |
| Interpretability | Perfect | Good | Poor | Poor |
| Best For | Simple logic | Structured data | Images, text, audio | Sequential decisions |
In production, successful AI systems rarely use just one approach. They combine multiple paradigms to leverage the strengths of each:
Deep Learning (learns language patterns from text) + Reinforcement Learning (learns to be helpful and safe from human feedback) + Rules (content filtering, safety guardrails)
Deep Learning (computer vision for object detection) + Reinforcement Learning (decision-making and path planning) + Rules (safety constraints, traffic laws)
Deep Learning (learns product/user embeddings) + Statistical ML (collaborative filtering) + Rules (business logic, inventory constraints)
Statistical ML (anomaly detection, baseline learning) + Rules (threshold alerts, known error patterns) + sometimes Deep Learning (log analysis, pattern recognition)
Statistical ML (transaction pattern analysis) + Deep Learning (learning from complex behavioral sequences) + Rules (known fraud patterns, regulatory requirements)
Deep Learning (conversational IVR, speech-to-text, emotion detection) + Statistical ML (call routing, sentiment analysis, quality scoring) + Rules (compliance checks, escalation policies) + sometimes Reinforcement Learning (optimizing scheduling and routing strategies)
Key takeaway: Understanding all four paradigms helps recognize how different components work together and choose the right tool for each part of the problem.
The rest of this course focuses on Deep Learning (Type 3)—the paradigm behind modern breakthroughs in language, vision, and generation.
These fundamentals provide the foundation for understanding how modern AI systems work—from the math to production deployment.
Test understanding of matrices by answering all questions!