evals.report
BenchmarksLabsCompareRun guides
BenchmarksMultimodal

MMMU-Pro

The harder MMMU-Pro multimodal reasoning benchmark (college-level subject tasks with text and images); the variant current frontier models report.

MultimodalaccuracyHigher is better
ModelLabScoreSource modelStatusDate
GPT-5.4OpenAI82.1%GPT-5.4 Thinking w/ toolsOfficialApr 8, 2026Details
Gemini 3 ProGoogle DeepMind81.0%Gemini 3.0 ProOfficialApr 8, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind80.5%Gemini 3.1 Pro Thinking (High)OfficialApr 8, 2026Details
Muse SparkMeta80.4%Muse Spark ThinkingOfficialApr 8, 2026Details
GPT-5.2OpenAI80.4%GPT-5.2 Thinking w/o PythonOfficialApr 8, 2026Details
GPT-5.1OpenAI79.0%GPT-5.1 ThinkingOfficialApr 8, 2026Details
GPT-5 highOpenAI78.4%GPT-5 w/ thinkingOfficialApr 8, 2026Details
Claude Opus 4.6Anthropic77.3%Claude Opus 4.6 w/ toolsOfficialApr 8, 2026Details
o3OpenAI76.4%o3OfficialApr 8, 2026Details
Claude Sonnet 4.6Anthropic75.6%Claude Sonnet 4.6 w/ toolsOfficialApr 8, 2026Details
Claude Opus 4.5Anthropic73.9%Claude Opus 4.5OfficialApr 8, 2026Details
Claude Sonnet 4.5Anthropic68.9%Claude Sonnet 4.5OfficialApr 8, 2026Details
Gemini 2.5 ProGoogle DeepMind68.0%Gemini 2.5 Pro 05-06OfficialApr 8, 2026Details

Each row reports the model’s accuracy on MMMU-Pro. Click a row for the full run context.