Question 1

What is MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark)?

Accepted Answer

A benchmark of ~11.5K college-level multimodal questions spanning 30 subjects and 183 subfields across six disciplines, measuring a vision-language model's accuracy at jointly perceiving images (charts, diagrams, maps, tables, etc.) and reasoning with domain knowledge. It is a multimodal benchmark measured by accuracy.

Question 2

What does accuracy mean on MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark)?

Accepted Answer

MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark) reports accuracy (%); higher is better. Scores are shown only within MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark) and are never averaged with other benchmarks.

Question 3

What is the top reported MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark) score?

Accepted Answer

GPT-5.1 has the top reported score on MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark): 85.4% (accuracy).

Question 4

Why do MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark) scores differ across runs?

Accepted Answer

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Question 5

Does evals.report rank models across benchmarks?

Accepted Answer

No. MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark) scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

Model	Lab	Score↓	Source model	Status	Date
GPT-5.1	OpenAI	85.4%	—	Unverified	Nov 12, 2025	Details
GPT-5	OpenAI	84.2%	—	Verified	Aug 7, 2025	Details
o3	OpenAI	82.9%	—	Verified	Apr 16, 2025	Details
Gemini 2.5 Pro	Google DeepMind	81.7%	—	Verified	Mar 25, 2025	Details
o4-mini	OpenAI	81.6%	—	Verified	Apr 16, 2025	Details
Claude Opus 4.5	Anthropic	80.7%	—	Verified	Nov 24, 2025	Details
Gemini 2.5 Flash	Google DeepMind	79.7%	—	Unverified	Apr 17, 2025	Details
Claude Opus 4.1	Anthropic	77.1%	—	Verified	Aug 5, 2025	Details
Claude Opus 4	Anthropic	76.5%	—	Verified	May 22, 2025	Details
Claude 3.7 Sonnet	Anthropic	75.0%	—	Unverified	Feb 24, 2025	Details
GPT-4.1	OpenAI	74.8%	—	Unverified	Apr 14, 2025	Details
Claude Sonnet 4	Anthropic	74.4%	—	Unverified	May 22, 2025	Details
Llama 4 MaverickOpen	Meta	73.4%	—	Unverified	Apr 5, 2025	Details
Claude Haiku 4.5	Anthropic	73.2%	—	Verified	Oct 15, 2025	Details
Gemini 2.0 Flash	Google DeepMind	70.7%	—	Unverified	Dec 11, 2024	Details
Llama 4 ScoutOpen	Meta	69.4%	—	Unverified	Apr 5, 2025	Details
GPT-4o	OpenAI	69.1%	—	Verified	May 13, 2024	Details
Claude 3.5 Sonnet	Anthropic	68.3%	—	Unverified	Jun 20, 2024	Details
Gemini 1.5 Pro	Google DeepMind	65.9%	—	Unverified	Feb 15, 2024	Details