evals.report
BenchmarksLabsCompareRun guides
BenchmarksMultimodal

MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark)

A benchmark of ~11.5K college-level multimodal questions spanning 30 subjects and 183 subfields across six disciplines, measuring a vision-language model's accuracy at jointly perceiving images (charts, diagrams, maps, tables, etc.) and reasoning with domain knowledge.

MultimodalaccuracyHigher is better
ModelLabScoreSource modelStatusDate
GPT-5.1OpenAI85.4%UnverifiedNov 12, 2025Details
GPT-5OpenAI84.2%VerifiedAug 7, 2025Details
o3OpenAI82.9%VerifiedApr 16, 2025Details
Gemini 2.5 ProGoogle DeepMind81.7%VerifiedMar 25, 2025Details
o4-miniOpenAI81.6%VerifiedApr 16, 2025Details
Claude Opus 4.5Anthropic80.7%VerifiedNov 24, 2025Details
Gemini 2.5 FlashGoogle DeepMind79.7%UnverifiedApr 17, 2025Details
Claude Opus 4.1Anthropic77.1%VerifiedAug 5, 2025Details
Claude Opus 4Anthropic76.5%VerifiedMay 22, 2025Details
Claude 3.7 SonnetAnthropic75.0%UnverifiedFeb 24, 2025Details
GPT-4.1OpenAI74.8%UnverifiedApr 14, 2025Details
Claude Sonnet 4Anthropic74.4%UnverifiedMay 22, 2025Details
Llama 4 MaverickMeta73.4%UnverifiedApr 5, 2025Details
Claude Haiku 4.5Anthropic73.2%VerifiedOct 15, 2025Details
Gemini 2.0 FlashGoogle DeepMind70.7%UnverifiedDec 11, 2024Details
Llama 4 ScoutMeta69.4%UnverifiedApr 5, 2025Details
GPT-4oOpenAI69.1%VerifiedMay 13, 2024Details
Claude 3.5 SonnetAnthropic68.3%UnverifiedJun 20, 2024Details
Gemini 1.5 ProGoogle DeepMind65.9%UnverifiedFeb 15, 2024Details

Each row reports the model’s accuracy on MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark). Click a row for the full run context.