Multimodal benchmarks
The multimodal benchmark family
Vision-language models have their own benchmark ecosystem. The major ones in 2026:
| Benchmark | What it tests |
|---|---|
| MMMU | Multi-discipline college-level vision-language reasoning |
| MMMU-Pro | The harder successor |
| MathVista | Math problems with diagrams, charts, geometry |
| ChartQA | Chart and graph comprehension |
| DocVQA | Document understanding (forms, scans, PDFs) |
| OCRBench | Low-level OCR capability |
| VQAv2 | Classic visual question answering (older, mostly saturated) |
| MMVet | Integration of core vision capabilities |
| AI2D | Diagram understanding (grade-school science) |
| MathVerse | Mathematical reasoning with visual input |
MMMU, Massive Multi-discipline Multimodal Understanding
- 11,500 college-exam-level questions across 30 disciplines, each with one or more images. Subjects span art, business, health, humanities, sciences, tech.
What it measures. Image + text reasoning at college level.
Saturation. Top models in the 70–80% range (April 2026). MMMU itself is largely saturated for leading models.
MMMU-Pro. 2024 successor. More robust filtering, vision-only variants (no text cues), harder overall. Frontier models in the 60–70% range. Still differentiates.
MathVista
- Mathematical reasoning requiring visual understanding: geometry problems, function plots, chart-based math, tables.
What it measures. The intersection of visual-spatial reasoning and arithmetic. Exposes a common blind spot: models can solve an algebra problem, but struggle when the same problem is presented as a geometry diagram.
Current state. Top reasoning models 65–80% range. Older non-reasoning models significantly lower.
ChartQA
Chart-and-graph question answering. Bar charts, line charts, pie charts, “what was the revenue in Q2?” kind of questions.
Why it matters. Charts are ubiquitous in business and science, and they’re a known failure mode. A model may read a table of numbers fine but fail to extract numbers from a chart of the same data.
Saturation. Approaching but not fully saturated. Top models 85–90%.
DocVQA and related document-understanding benchmarks
- DocVQA, question answering over scanned documents.
- InfographicVQA, harder; complex designed infographics with text + images + icons.
- TextVQA, short text within natural images.
These exercise OCR + layout + reasoning together. Models good at pure OCR (good text extraction) but bad at layout reasoning fail; so do the reverse.
OCRBench
Specifically tests OCR capability in isolation, transcription of text from images, including weird fonts, rotated text, mathematical notation, multi-language.
Why it matters. A prerequisite for DocVQA. Weak OCR guarantees weak DocVQA.
MMVet, MMBench
Broader vision-language evaluation frameworks aggregating multiple capabilities (recognition, OCR, knowledge, math, spatial reasoning). Used for comprehensive model cards rather than single-number comparisons.
Video benchmarks
- Video-MME, comprehensive video understanding (short to long clips).
- MVBench, 20 video tasks spanning action, scene, object, and attribute understanding.
- LongVideoBench, long-form video question answering (hour-plus videos).
Video benchmarks lag image benchmarks, the models are weaker, the benchmarks are less mature, and compute costs are prohibitive.
Audio benchmarks
- AudioBench, audio question answering.
- MMAU, multimodal audio understanding.
Still early in development. Most “multimodal” claims in 2026 are primarily vision-language; audio is catch-up.
Embodied / spatial benchmarks
- SpatialBench, MindCube, 3D and embodied spatial reasoning.
- RoboBench, robot-task completion from visual input.
Emerging category. Frontier models struggle; benchmark-gaming is less of a concern because they’re hard to saturate.
What multimodal benchmarks don’t measure
- Diagram generation. Most benchmarks test understanding; creating clean diagrams is an orthogonal skill.
- Interactive manipulation. Benchmarks mostly use static images; real use involves screenshots that change.
- Video generation quality. Video-MME tests understanding, not generation.
- Cross-modal reasoning at scale. Combining image + audio + text in one task is barely benchmarked.
- Grounded interaction. “Point to the button”, barely tested.
Reading multimodal scores
Pure vision vs VQA
A model may ace VQAv2 (classic visual QA) and fail MMMU. The former is closer to “object recognition”; the latter is “reason about what you see.” Different capabilities.
Text-in-image leakage
Some “vision” tasks become text tasks if the model does OCR and then reasons from the extracted text. MMMU-Pro includes vision-only variants specifically to catch this.
Context-length interactions
Some multimodal benchmarks now include long-form documents with many pages. Scores here couple vision ability with long-context ability.
Reasoning-mode multiplier
Same as every other benchmark: reasoning-mode variants score ~15–25 points higher on hard multimodal reasoning.
The current state of the art (April 2026)
Frontier multimodal models (GPT-5.x vision, Gemini 3.x, Claude 4.x vision):
- MMMU ~80%, closing on saturation.
- MMMU-Pro ~65%, still differentiates.
- MathVista ~75% with reasoning.
- ChartQA ~90%, nearly saturated.
- DocVQA ~93%, nearly saturated.
The frontier is shifting toward:
- Longer videos (an hour+ of footage).
- Agentic multimodal, a model using a browser with screenshots in the loop.
- High-resolution technical diagrams (engineering drawings, scientific figures).
- 3D / spatial.
References
- MMMU, Yue et al., 2023
- MMMU-Pro, the successor
- MathVista, Lu et al., 2023
- ChartQA
- DocVQA
- OCRBench
- MMBench
- Video-MME
- LongVideoBench
- OpenCompass multimodal leaderboards
Related topics
- Knowledge and reasoning benchmarks, text-only counterpart
- Agent benchmarks, where multimodal matters for computer-use
- RAG, increasingly multimodal (retrieve across images + text)