Chapter 7: Non-Linearity

What You'll Learn

So far, we've learned about matrices and how neural networks transform data through layers. But there's a crucial missing piece: non-linearity. Without it, stacking multiple layers is pointless! This chapter will cover:

The Linear Limitation

Discover why multiple linear layers are mathematically equivalent to just one layer.

Why stacking linear transformations doesn't add power
The XOR problem: patterns linear models can't learn
Visual proof of the limitation

Layer 1: H = X × W₁

Layer 2: Y = H × W₂

= X × W₁ × W₂

= X × W_combined

Two layers = One layer! We need more.

Activation Functions

Meet the non-linear functions that make deep learning possible.

ReLU: max(0, x) - the simplest non-linearity
Sigmoid: Squashing outputs to 0-1 range
Why we apply them after each layer
Visual examples of what they do to data

ReLU:

-5 → 0 3 → 3 -2 → 0 7 → 7

Sigmoid:

-∞ → 0 0 → 0.5 +∞ → 1

Composing Non-Linear Layers

How deep networks build hierarchies of understanding.

Layer 1: Learns simple patterns (edges, basic features)
Layer 2: Combines them into mid-level patterns
Deep layers: Build complex, abstract representations
Why it's called "deep" learning

Layer 1: Lines, edges

↓

Layer 2: Shapes, textures

↓

Layer 3: Objects, faces

The Vanishing Gradient Problem

Why deep learning was nearly impossible before 2010 and how we solved it.

What happens when gradients shrink to zero in deep networks
Why sigmoid activation functions caused this problem
How ReLU and batch normalization fixed it
Exploding gradients: The opposite problem

Layer 100: Gradient = 0.0001

↑

Layer 50: Gradient = 0.01

↑

Layer 1: Gradient = 1.0

Early layers can't learn! Gradient disappeared.

Regularization: Fighting Overfitting

Techniques to prevent models from memorizing training data.

L1 Regularization: Penalty for using many features (creates sparsity)
L2 Regularization: Penalty for large weights (smooths predictions)
Dropout: Randomly "turn off" neurons during training
When and why to use each technique

Without regularization:

Training: 99% | Test: 60% ❌

With dropout (0.5):

Training: 87% | Test: 85% ✅

Batch Normalization

The technique that made training deep networks dramatically faster.

Why activations become unstable in deep networks
How batch norm normalizes layer inputs during training
Why it speeds up learning by 10x or more
Layer norm: The variant used in transformers

Before: Values: -100, 0.5, 200, 5

↓ Batch Norm ↓

After: Values: -1.2, 0.0, 1.5, 0.3

Stabilized! Gradients can flow smoothly.

Non-Linearity: Why Stacking Layers Matters

What You'll Learn

The Linear Limitation

Activation Functions

Composing Non-Linear Layers

The Vanishing Gradient Problem

Regularization: Fighting Overfitting

Batch Normalization

Why This Matters

Without Non-Linearity

With Non-Linearity

Stay Tuned!