← Deep Learning & Modern AI Chapter 9
Coming Soon

Evaluation & Benchmarks

How we measure LLM quality—and why it's harder than you think

The Measurement Challenge

How do you know if GPT-4 is "better" than Claude? Or if your fine-tuned model actually improved?

Unlike traditional software (where tests either pass or fail), LLMs are probabilistic. The same model can give different answers to the same question. Quality is subjective. So how do we measure progress?

The answer: Benchmarks — standardized tests that measure specific capabilities. But as you'll learn, benchmarks have serious problems.

What You'll Learn

This chapter covers how the AI community measures model quality, the major benchmarks everyone uses, and the growing crisis in evaluation.

01

Why Evaluation Matters

Understanding the importance of measuring AI capabilities

  • Research: Proving your model is better than previous ones
  • Production: Ensuring your LLM doesn't regress after updates
  • Selection: Choosing between GPT-4, Claude, or LLaMA for your use case
  • Fine-tuning: Validating that your training improved the model
02

Major LLM Benchmarks (2025)

The tests everyone uses to compare models

  • MMLU: 57 subjects, multiple choice, tests world knowledge
  • HumanEval: Code generation (Python functions)
  • GPQA: Graduate-level science questions (PhD-hard)
  • DROP: Reading comprehension with math reasoning
  • HellaSwag: Common sense reasoning
  • TruthfulQA: Avoiding falsehoods and misconceptions
  • ARC-AGI: Abstract reasoning (human-like intelligence test)
MMLU Example:
Q: What is the primary function of mitochondria?
A) Photosynthesis B) Energy production C) DNA storage D) Protein synthesis
HumanEval Example:
Write a function that returns the nth Fibonacci number
03

Benchmark Scores: What They Mean

Interpreting the numbers and understanding model capabilities

  • MMLU 90%: Approaching expert-level knowledge
  • HumanEval 85%: Can solve most standard coding problems
  • GPQA 60%: PhD-level reasoning (vs 35% human expert baseline)
  • Why percentages alone don't tell the full story
Model MMLU HumanEval GPQA
GPT-4 86% 67% 50%
Claude 3.5 88% 92% 59%
o3 (2025) 92% 85% 88%
04

The Benchmark Crisis

Why traditional benchmarks are breaking down

  • Contamination: Training data includes benchmark questions
  • Saturation: Models score 95%+ (ceiling effect)
  • Gaming: Optimizing for benchmarks != real capability
  • Brittleness: Small prompt changes = huge score drops
  • Why we need new evaluation methods
⚠️
2020: GPT-3 scores 45% on MMLU → "Impressive!"
📈
2023: GPT-4 scores 86% → "Near expert!"
🤔
2025: Most models score 90%+ → "Now what?"
05

Agent Benchmarks: The New Frontier

Testing multi-step reasoning and real-world tasks

  • SWE-bench: Real GitHub issues—can agents fix bugs?
  • WebArena: Navigate real websites to complete tasks
  • AgentBench: Multi-environment agent evaluation
  • GAIA: Real-world questions requiring tool use
  • Why these are harder to game than traditional benchmarks
06

Human Evaluation: The Gold Standard

When humans judge quality directly

  • Chatbot Arena: Users vote on which response is better (ELO ratings)
  • Head-to-head comparisons: A vs B blind tests
  • Limitations: Expensive, slow, subjective
  • But necessary: For tasks like creativity, helpfulness, style
07

Domain-Specific Evaluation

Measuring performance on specialized tasks

  • Medical: MedQA, PubMedQA (clinical knowledge)
  • Legal: Legal Bench (contracts, case law)
  • Code: MBPP, CodeForces ratings
  • Math: MATH, GSM8K (grade school to competition level)
  • Creating custom benchmarks for your domain
08

LLM-as-Judge: Using AI to Evaluate AI

Can we use GPT-4 to evaluate other models?

  • The idea: Have a strong LLM judge weaker LLMs' outputs
  • Pros: Cheap, fast, scalable
  • Cons: Biased toward similar models, can be gamed
  • When it works: Open-ended tasks (summarization, creative writing)
  • When it fails: Factual accuracy, math, code correctness

Practical Evaluation Strategies

📊

For Researchers

  • Report multiple benchmarks (not just one)
  • Include contamination analysis
  • Show per-category breakdowns
  • Compare to baselines, not just leaderboard top
🏭

For Production

  • Create task-specific test sets (100-1000 examples)
  • Monitor regression: Does new version perform worse?
  • A/B test with real users
  • Track business metrics (user satisfaction, task success rate)
🎯

For Model Selection

  • Ignore marketing: Test on YOUR tasks
  • Create 50-100 representative examples
  • Run all candidate models
  • Compare outputs blind (no model names)
🔬

For Fine-tuning

  • Hold out 20% of data for validation
  • Track loss curves during training
  • Evaluate on held-out set after each epoch
  • Compare to base model on same test set

The Future of Evaluation

🧪 Adversarial Testing

Dynamically generated tests that adapt to model weaknesses (e.g., red teaming)

🌐 Real-World Tasks

Benchmarks based on actual user requests, not academic datasets

🔄 Continuous Evaluation

Always-updating benchmarks to prevent contamination and saturation

🎭 Multi-Modal Assessment

Evaluating vision, audio, video understanding—not just text

Key Takeaways

1. Benchmarks ≠ Real Performance

High scores on MMLU don't guarantee your model will be good at YOUR task.

2. Test on Your Data

The best evaluation is always task-specific. Create your own test set.

3. Multiple Metrics Matter

One number can't capture model quality. Look at diverse capabilities.

4. Evaluation is Evolving

As models improve, we need harder tests. What worked in 2020 doesn't work in 2025.

Benchmark Resources

🏆 Leaderboards

  • HuggingFace Open LLM Leaderboard
  • Chatbot Arena (LMSYS)
  • Papers With Code

📚 Benchmark Datasets

  • MMLU, HellaSwag, ARC on HuggingFace
  • HumanEval on GitHub
  • BIG-bench (Google)

🛠️ Evaluation Tools

  • lm-evaluation-harness (Eleuther AI)
  • OpenAI Evals framework
  • LangSmith (LangChain)

Coming Soon!

This chapter will include interactive demos where you can evaluate different models on the same tasks, create your own benchmarks, and understand why evaluation is both an art and a science.

All Chapters