BenchmarksMultimodal
MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark)
A benchmark of ~11.5K college-level multimodal questions spanning 30 subjects and 183 subfields across six disciplines, measuring a vision-language model's accuracy at jointly perceiving images (charts, diagrams, maps, tables, etc.) and reasoning with domain knowledge.
MultimodalaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| GPT-5.1 | OpenAI | 85.4% | — | Unverified | Nov 12, 2025 | Details |
| GPT-5 | OpenAI | 84.2% | — | Verified | Aug 7, 2025 | Details |
| o3 | OpenAI | 82.9% | — | Verified | Apr 16, 2025 | Details |
| Gemini 2.5 Pro | Google DeepMind | 81.7% | — | Verified | Mar 25, 2025 | Details |
| o4-mini | OpenAI | 81.6% | — | Verified | Apr 16, 2025 | Details |
| Claude Opus 4.5 | Anthropic | 80.7% | — | Verified | Nov 24, 2025 | Details |
| Gemini 2.5 Flash | Google DeepMind | 79.7% | — | Unverified | Apr 17, 2025 | Details |
| Claude Opus 4.1 | Anthropic | 77.1% | — | Verified | Aug 5, 2025 | Details |
| Claude Opus 4 | Anthropic | 76.5% | — | Verified | May 22, 2025 | Details |
| Claude 3.7 Sonnet | Anthropic | 75.0% | — | Unverified | Feb 24, 2025 | Details |
| GPT-4.1 | OpenAI | 74.8% | — | Unverified | Apr 14, 2025 | Details |
| Claude Sonnet 4 | Anthropic | 74.4% | — | Unverified | May 22, 2025 | Details |
| Llama 4 Maverick | Meta | 73.4% | — | Unverified | Apr 5, 2025 | Details |
| Claude Haiku 4.5 | Anthropic | 73.2% | — | Verified | Oct 15, 2025 | Details |
| Gemini 2.0 Flash | Google DeepMind | 70.7% | — | Unverified | Dec 11, 2024 | Details |
| Llama 4 Scout | Meta | 69.4% | — | Unverified | Apr 5, 2025 | Details |
| GPT-4o | OpenAI | 69.1% | — | Verified | May 13, 2024 | Details |
| Claude 3.5 Sonnet | Anthropic | 68.3% | — | Unverified | Jun 20, 2024 | Details |
| Gemini 1.5 Pro | Google DeepMind | 65.9% | — | Unverified | Feb 15, 2024 | Details |
Each row reports the model’s accuracy on MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark). Click a row for the full run context.