LLM reasoning benchmarks and metrics

Why benchmarks

Every generation of LLMs arrives with charts. New model X scores N on benchmark Y; therefore, new model X is the best. The trouble is benchmarks measure specific, narrow things. A model can top MMLU-Pro and still be a poor coding agent; a model can ace SWE-bench Verified and still fail at a reasoning task a smart undergrad solves in seconds.

This series is a map of the benchmark landscape in 2026. What each one measures, how close it is to saturation, how it’s been gamed, and what a given score does and doesn’t tell you.

The series

Overview, this page.
Knowledge and reasoning, MMLU, MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, ARC-AGI / ARC-AGI-2, HellaSwag, TruthfulQA.
Math benchmarks, GSM8K, MATH, AIME, FrontierMath, OlympiadBench.
Coding benchmarks, HumanEval, MBPP, LiveCodeBench, APPS, CodeContests, SWE-bench and SWE-bench Verified / Pro.
Agent benchmarks, Terminal-Bench, TAU-bench, OSWorld, WebArena, GAIA, BrowseComp.
Long-context benchmarks, NIAH, RULER, LongBench, U-NIAH.
Multimodal benchmarks, MMMU, MathVista, ChartQA, DocVQA.
Evaluation methodology and metrics, LMArena, LLM-as-judge, pass@k, contamination detection, golden sets, preference elicitation.

How benchmarks go wrong

Four failure modes to have in your head before reading any leaderboard:

1. Saturation

When state-of-the-art models consistently score above ~90%, the benchmark stops differentiating. MMLU was saturated by 2023 (85%+ on GPT-4), which is why MMLU-Pro was created. MMLU-Pro is itself approaching saturation in 2026, Gemini 3 Pro at ~90.1%, Claude Opus 4.5 at ~89.5%.

2. Contamination

Benchmarks are released publicly. Their solutions end up in training data. Next-generation models “score higher” partly because they’ve seen the answers. OpenAI has confirmed training-data leakage on SWE-bench Verified across every frontier model. This is why benchmarks like LiveCodeBench, which draws fresh problems after known training cutoffs, have become the more trusted coding signal.

3. Reward hacking

Agents optimize for what’s measured. The RDI Berkeley blog “How We Broke Top AI Agent Benchmarks” shows how nearly every major agent benchmark can be scored near 100% by an agent that exploits environment bugs, file-system tricks, or success-detection heuristics instead of actually completing the task.

4. Contamination-by-proxy

A benchmark might not be in training data verbatim, but structurally similar content is. A model trained on every competitive-programming problem from 2015–2023 can “solve” LeetCode-style benchmarks by recognizing the template, not by reasoning.

What to read a benchmark score as

An upper bound on specific capability. A high MMLU-Pro score doesn’t mean the model is smart; it means it can answer multiple-choice questions in the MMLU-Pro format.
Evidence of a training regime. Specific benchmarks respond to specific training investments (RLHF, long-context training, tool-use fine-tuning). High score = investment in that capability.
A contamination-sensitive proxy. Newer benchmarks are usually more trustworthy. Old benchmarks near saturation are essentially training-data memorization tests.
Not a user-facing metric. LMArena preference scores track how users feel about models, which can diverge wildly from benchmark performance.

The frontier benchmarks worth watching in 2026

A shortlist if you’re tracking frontier progress:

Benchmark	What it measures	Saturation state (Apr 2026)
Humanity’s Last Exam (HLE)	Expert-level knowledge across PhD-grade subjects	~35% (far from saturated)
ARC-AGI-2	Fluid, visual-pattern reasoning	~85% top; human avg 60%
FrontierMath	Research-level math problems	Single-digit to mid-teens
SWE-bench Pro / Verified	Real-world software engineering	~80–90% top
Terminal-Bench	Autonomous terminal/sysadmin	~77% top
LiveCodeBench	Contamination-resistant coding	Rolling; frontier ~60–70%
GPQA Diamond	Graduate-level reasoning across STEM	~92% top (approaching saturation)
MMLU-Pro	Multi-domain knowledge, harder than MMLU	~90% top (approaching saturation)
OSWorld / WebArena	Computer / browser use by agents	Low 30–50% (hard)

You’ll see different numbers on different leaderboards, the details depend on prompt format, tool use, reasoning budget, and whether the model is run in “thinking” mode. Treat all specific numbers as approximate; the ordering matters more than the percentage.

A note on this series

The benchmark landscape moves fast. What’s frontier in April 2026 will be historical by year-end. This series focuses on the structure of each benchmark, what it measures, how it’s administered, what it’s vulnerable to, so you can read future leaderboards intelligently, not memorize current scores.

References

Chatbot Arena, the de facto human-preference leaderboard
Artificial Analysis, aggregated benchmark leaderboards
HELM (Stanford CRFM), holistic evaluation framework
OpenCompass, comprehensive LLM benchmarking
Epoch AI benchmarks, frontier-model tracking
lmsys, Chatbot Arena paper
Scale AI, SEAL leaderboards, private, contamination-resistant
RDI Berkeley, how we broke the benchmarks

AI Harness Development, how benchmarks get wired into agents
RAG, a separate capability with its own benchmark family
AI Coding Tool Blindspots, what even top-scoring models still miss