← All Chapters Chapter 9

Coming Soon

Modern LLM Architectures (2025)

From test-time compute to mixture-of-experts: the latest breakthroughs in AI

The Evolution of LLMs

2017

Transformers Introduced

"Attention Is All You Need" paper revolutionizes NLP

2018-2019

BERT & GPT-2

Pre-training + fine-tuning becomes dominant paradigm

2020-2023

Scale Era

GPT-3, GPT-4, Claude, LLaMA: bigger models, better performance

2024-2025

New Paradigm

Test-time compute, MoE, reasoning models, efficiency breakthroughs

What You'll Learn

The AI landscape has shifted dramatically in 2024-2025. This chapter covers the latest architectural innovations that define modern LLMs:

Test-Time Compute & Reasoning Models

The 2024-2025 breakthrough: models that "think longer" perform better

OpenAI o1 & o3: Chain-of-thought at inference time
Why spending more compute during generation helps
o3's 88% on ARC-AGI (vs 32% for o1)
Trade-offs: accuracy vs cost vs latency
The shift from pre-training scaling to inference scaling

Key Insight: Sometimes it's better to let a smaller model "think longer" than to make the model bigger

Mixture-of-Experts (MoE)

Activate only the "experts" you need for each token

Dense vs MoE architectures
How routing works: which expert for which input?
400B parameters, only 50B active per token
Examples: Mixtral, Gemini, DeepSeek-V3, Llama 4
Trade-offs: complexity vs efficiency

Dense (GPT-4, Claude)

All parameters active for every token

MoE (Gemini, Llama 4)

Only relevant experts activated per token

Context Windows: 128K to 10M Tokens

From processing paragraphs to processing entire codebases

Evolution: 2K → 8K → 100K → 10M tokens
Technical challenges: attention complexity is O(n²)
Solutions: Flash Attention, RoPE scaling, sparse attention
Real applications: analyzing entire books, repositories
Llama 4 Scout: 10M token context window

Multimodal Models

Beyond text: vision, audio, and unified understanding

Architecture: Vision encoder + LLM + connector
GPT-4V, Claude 3.5 Sonnet, Gemini 2.0
How images become embeddings
Cross-modal attention: text attending to images
Future: video, audio, 3D models

Chain-of-Thought & Self-Reflection

Teaching models to reason step-by-step and check their work

What is chain-of-thought reasoning?
Prompting vs trained CoT (o1 models)
Self-consistency: generating multiple solutions
Backtracking and error correction
Limitations: brittleness outside training distribution

Training Optimization: From SGD to Adam

How modern LLMs learn: the algorithms that made billion-parameter models trainable

SGD: The basic algorithm (used Chapter 1's gradient descent)
Momentum: Accelerating convergence by "remembering" previous updates
Adam: The optimizer that powers most LLMs (adaptive learning rates)
Learning rate schedules: Warmup + cosine decay
Batch vs Mini-Batch: Why we don't train on all data at once
Why ChatGPT trains in weeks, not years

SGD:

w = w - lr × gradient

Simple but slow to converge

Adam:

Adapts lr per parameter

Fast, stable, industry standard

Inference Efficiency & Deployment

Running powerful models faster and cheaper in production

Quantization: FP16 → INT8 → INT4 precision
Distillation: Teaching small models from large ones
Pruning: Removing unnecessary parameters
KV caching: Speeding up generation by 10x
Flash Attention: Making attention computation faster
Real-world impact: running 70B models on consumer GPUs

2025 Model Landscape

GPT-4 / GPT-5

Architecture: Dense / Router-based

Context: 128K tokens

Strengths: General reasoning, coding

Claude 3.5 / 4

Architecture: Dense

Context: 200K tokens

Strengths: Analysis, safety, coding

Llama 4

Architecture: MoE

Context: 128K - 10M tokens

Strengths: Open source, multimodal

Gemini 2.5

Architecture: MoE

Context: 1M+ tokens

Strengths: Multimodal, long context

o1 / o3
 Architecture: Reasoning-focused 
 Context: 128K tokens 
 Strengths: Math, science, hard reasoning 
Breakthrough 2024-2025

DeepSeek-V3

Architecture: MoE

Context: 128K tokens

Strengths: Efficiency, cost-effective

The 2024-2025 Paradigm Shift

Old Paradigm (2020-2023)

Better AI = Bigger Model + More Data

Pre-training scaling was king
Dense architectures dominated
Context windows: 2K-8K tokens
Single-modality (text only)
Fast inference, simple generation

→

New Paradigm (2024-2025)

Better AI = Smart Architecture + Test-Time Compute + Efficiency

Test-time compute scaling matters
MoE for efficiency at scale
Context windows: 128K-10M tokens
Multimodal by default
Reasoning-focused generation

The Key Insight

We've learned that how you use compute matters as much as how much compute you have. A smaller model that "thinks longer" can outperform a larger model that answers immediately. This realization has fundamentally changed how we build and deploy LLMs.

Looking Ahead

🔬

Mechanistic Interpretability

Understanding what's happening inside the billions of parameters. Anthropic's goal: detect most AI problems by 2027.

🎯

Specialized Models

Medical LLMs, legal LLMs, coding-specific models. Fine-tuned architectures for specific domains.

⚡

Edge Deployment

Running powerful models on phones and laptops. Quantization and distillation enabling local AI.

🤝

Agent Systems

LLMs that use tools, browse the web, write and execute code. Multi-agent collaboration.

In Development

This chapter will provide a comprehensive tour of 2025's most important LLM innovations. We'll explain not just what these techniques are, but why they matter and how they work—building on everything you've learned in previous chapters.

From the mathematical foundations of MoE routing to the engineering breakthroughs enabling 10M token contexts, you'll gain a deep understanding of modern AI systems.

← Chapter 9 All Chapters