Skip to content

Multimodal benchmarks

The multimodal benchmark family

Vision-language models have their own benchmark ecosystem. The major ones in 2026:

BenchmarkWhat it tests
MMMUMulti-discipline college-level vision-language reasoning
MMMU-ProThe harder successor
MathVistaMath problems with diagrams, charts, geometry
ChartQAChart and graph comprehension
DocVQADocument understanding (forms, scans, PDFs)
OCRBenchLow-level OCR capability
VQAv2Classic visual question answering (older, mostly saturated)
MMVetIntegration of core vision capabilities
AI2DDiagram understanding (grade-school science)
MathVerseMathematical reasoning with visual input

MMMU, Massive Multi-discipline Multimodal Understanding

  1. 11,500 college-exam-level questions across 30 disciplines, each with one or more images. Subjects span art, business, health, humanities, sciences, tech.

What it measures. Image + text reasoning at college level.

Saturation. Top models in the 70–80% range (April 2026). MMMU itself is largely saturated for leading models.

MMMU-Pro. 2024 successor. More robust filtering, vision-only variants (no text cues), harder overall. Frontier models in the 60–70% range. Still differentiates.

MathVista

  1. Mathematical reasoning requiring visual understanding: geometry problems, function plots, chart-based math, tables.

What it measures. The intersection of visual-spatial reasoning and arithmetic. Exposes a common blind spot: models can solve an algebra problem, but struggle when the same problem is presented as a geometry diagram.

Current state. Top reasoning models 65–80% range. Older non-reasoning models significantly lower.

ChartQA

Chart-and-graph question answering. Bar charts, line charts, pie charts, “what was the revenue in Q2?” kind of questions.

Why it matters. Charts are ubiquitous in business and science, and they’re a known failure mode. A model may read a table of numbers fine but fail to extract numbers from a chart of the same data.

Saturation. Approaching but not fully saturated. Top models 85–90%.

  • DocVQA, question answering over scanned documents.
  • InfographicVQA, harder; complex designed infographics with text + images + icons.
  • TextVQA, short text within natural images.

These exercise OCR + layout + reasoning together. Models good at pure OCR (good text extraction) but bad at layout reasoning fail; so do the reverse.

OCRBench

Specifically tests OCR capability in isolation, transcription of text from images, including weird fonts, rotated text, mathematical notation, multi-language.

Why it matters. A prerequisite for DocVQA. Weak OCR guarantees weak DocVQA.

MMVet, MMBench

Broader vision-language evaluation frameworks aggregating multiple capabilities (recognition, OCR, knowledge, math, spatial reasoning). Used for comprehensive model cards rather than single-number comparisons.

Video benchmarks

  • Video-MME, comprehensive video understanding (short to long clips).
  • MVBench, 20 video tasks spanning action, scene, object, and attribute understanding.
  • LongVideoBench, long-form video question answering (hour-plus videos).

Video benchmarks lag image benchmarks, the models are weaker, the benchmarks are less mature, and compute costs are prohibitive.

Audio benchmarks

  • AudioBench, audio question answering.
  • MMAU, multimodal audio understanding.

Still early in development. Most “multimodal” claims in 2026 are primarily vision-language; audio is catch-up.

Embodied / spatial benchmarks

  • SpatialBench, MindCube, 3D and embodied spatial reasoning.
  • RoboBench, robot-task completion from visual input.

Emerging category. Frontier models struggle; benchmark-gaming is less of a concern because they’re hard to saturate.

What multimodal benchmarks don’t measure

  • Diagram generation. Most benchmarks test understanding; creating clean diagrams is an orthogonal skill.
  • Interactive manipulation. Benchmarks mostly use static images; real use involves screenshots that change.
  • Video generation quality. Video-MME tests understanding, not generation.
  • Cross-modal reasoning at scale. Combining image + audio + text in one task is barely benchmarked.
  • Grounded interaction. “Point to the button”, barely tested.

Reading multimodal scores

Pure vision vs VQA

A model may ace VQAv2 (classic visual QA) and fail MMMU. The former is closer to “object recognition”; the latter is “reason about what you see.” Different capabilities.

Text-in-image leakage

Some “vision” tasks become text tasks if the model does OCR and then reasons from the extracted text. MMMU-Pro includes vision-only variants specifically to catch this.

Context-length interactions

Some multimodal benchmarks now include long-form documents with many pages. Scores here couple vision ability with long-context ability.

Reasoning-mode multiplier

Same as every other benchmark: reasoning-mode variants score ~15–25 points higher on hard multimodal reasoning.

The current state of the art (April 2026)

Frontier multimodal models (GPT-5.x vision, Gemini 3.x, Claude 4.x vision):

  • MMMU ~80%, closing on saturation.
  • MMMU-Pro ~65%, still differentiates.
  • MathVista ~75% with reasoning.
  • ChartQA ~90%, nearly saturated.
  • DocVQA ~93%, nearly saturated.

The frontier is shifting toward:

  • Longer videos (an hour+ of footage).
  • Agentic multimodal, a model using a browser with screenshots in the loop.
  • High-resolution technical diagrams (engineering drawings, scientific figures).
  • 3D / spatial.

References