← All Chapters Chapter 9
Coming Soon

Modern LLM Architectures (2025)

From test-time compute to mixture-of-experts: the latest breakthroughs in AI

The Evolution of LLMs

2017

Transformers Introduced

"Attention Is All You Need" paper revolutionizes NLP

2018-2019

BERT & GPT-2

Pre-training + fine-tuning becomes dominant paradigm

2020-2023

Scale Era

GPT-3, GPT-4, Claude, LLaMA: bigger models, better performance

2024-2025

New Paradigm

Test-time compute, MoE, reasoning models, efficiency breakthroughs

What You'll Learn

The AI landscape has shifted dramatically in 2024-2025. This chapter covers the latest architectural innovations that define modern LLMs:

01

Test-Time Compute & Reasoning Models

The 2024-2025 breakthrough: models that "think longer" perform better

  • OpenAI o1 & o3: Chain-of-thought at inference time
  • Why spending more compute during generation helps
  • o3's 88% on ARC-AGI (vs 32% for o1)
  • Trade-offs: accuracy vs cost vs latency
  • The shift from pre-training scaling to inference scaling

Key Insight: Sometimes it's better to let a smaller model "think longer" than to make the model bigger

02

Mixture-of-Experts (MoE)

Activate only the "experts" you need for each token

  • Dense vs MoE architectures
  • How routing works: which expert for which input?
  • 400B parameters, only 50B active per token
  • Examples: Mixtral, Gemini, DeepSeek-V3, Llama 4
  • Trade-offs: complexity vs efficiency
Dense (GPT-4, Claude)

All parameters active for every token

MoE (Gemini, Llama 4)

Only relevant experts activated per token

03

Context Windows: 128K to 10M Tokens

From processing paragraphs to processing entire codebases

  • Evolution: 2K → 8K → 100K → 10M tokens
  • Technical challenges: attention complexity is O(n²)
  • Solutions: Flash Attention, RoPE scaling, sparse attention
  • Real applications: analyzing entire books, repositories
  • Llama 4 Scout: 10M token context window
04

Multimodal Models

Beyond text: vision, audio, and unified understanding

  • Architecture: Vision encoder + LLM + connector
  • GPT-4V, Claude 3.5 Sonnet, Gemini 2.0
  • How images become embeddings
  • Cross-modal attention: text attending to images
  • Future: video, audio, 3D models
05

Chain-of-Thought & Self-Reflection

Teaching models to reason step-by-step and check their work

  • What is chain-of-thought reasoning?
  • Prompting vs trained CoT (o1 models)
  • Self-consistency: generating multiple solutions
  • Backtracking and error correction
  • Limitations: brittleness outside training distribution
06

Training Optimization: From SGD to Adam

How modern LLMs learn: the algorithms that made billion-parameter models trainable

  • SGD: The basic algorithm (used Chapter 1's gradient descent)
  • Momentum: Accelerating convergence by "remembering" previous updates
  • Adam: The optimizer that powers most LLMs (adaptive learning rates)
  • Learning rate schedules: Warmup + cosine decay
  • Batch vs Mini-Batch: Why we don't train on all data at once
  • Why ChatGPT trains in weeks, not years
SGD:
w = w - lr × gradient

Simple but slow to converge

Adam:
Adapts lr per parameter

Fast, stable, industry standard

07

Inference Efficiency & Deployment

Running powerful models faster and cheaper in production

  • Quantization: FP16 → INT8 → INT4 precision
  • Distillation: Teaching small models from large ones
  • Pruning: Removing unnecessary parameters
  • KV caching: Speeding up generation by 10x
  • Flash Attention: Making attention computation faster
  • Real-world impact: running 70B models on consumer GPUs

2025 Model Landscape

Architecture: Dense / Router-based
Context: 128K tokens
Strengths: General reasoning, coding
Architecture: Dense
Context: 200K tokens
Strengths: Analysis, safety, coding
Architecture: MoE
Context: 128K - 10M tokens
Strengths: Open source, multimodal
Architecture: MoE
Context: 1M+ tokens
Strengths: Multimodal, long context
Architecture: Reasoning-focused
Context: 128K tokens
Strengths: Math, science, hard reasoning
Breakthrough 2024-2025
Architecture: MoE
Context: 128K tokens
Strengths: Efficiency, cost-effective

The 2024-2025 Paradigm Shift

Old Paradigm (2020-2023)

Better AI = Bigger Model + More Data
  • Pre-training scaling was king
  • Dense architectures dominated
  • Context windows: 2K-8K tokens
  • Single-modality (text only)
  • Fast inference, simple generation

New Paradigm (2024-2025)

Better AI = Smart Architecture + Test-Time Compute + Efficiency
  • Test-time compute scaling matters
  • MoE for efficiency at scale
  • Context windows: 128K-10M tokens
  • Multimodal by default
  • Reasoning-focused generation

The Key Insight

We've learned that how you use compute matters as much as how much compute you have. A smaller model that "thinks longer" can outperform a larger model that answers immediately. This realization has fundamentally changed how we build and deploy LLMs.

Looking Ahead

🔬

Mechanistic Interpretability

Understanding what's happening inside the billions of parameters. Anthropic's goal: detect most AI problems by 2027.

🎯

Specialized Models

Medical LLMs, legal LLMs, coding-specific models. Fine-tuned architectures for specific domains.

Edge Deployment

Running powerful models on phones and laptops. Quantization and distillation enabling local AI.

🤝

Agent Systems

LLMs that use tools, browse the web, write and execute code. Multi-agent collaboration.

In Development

This chapter will provide a comprehensive tour of 2025's most important LLM innovations. We'll explain not just what these techniques are, but why they matter and how they work—building on everything you've learned in previous chapters.

From the mathematical foundations of MoE routing to the engineering breakthroughs enabling 10M token contexts, you'll gain a deep understanding of modern AI systems.

All Chapters