BenchmarksReasoning
MMLU-Pro
A more robust and challenging successor to MMLU with over 12,000 reasoning-focused questions across 14 subjects, expanding answer choices from four to ten to better discriminate frontier large language models.
ReasoningaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Gemini 3.1 Pro Preview | Google DeepMind | 90.99% | — | Verified | Feb 19, 2026 | Details |
| Claude Opus 4.7 | Anthropic | 89.87% | — | Verified | Apr 16, 2026 | Details |
| Gemini 3 Pro | Google DeepMind | 89.8% | — | Verified | Nov 18, 2025 | Details |
| Claude Opus 4.5 | Anthropic | 89.5% | — | Verified | Nov 24, 2025 | Details |
| Gemini 3 Flash | Google DeepMind | 89.0% | — | Verified | Dec 17, 2025 | Details |
| Claude Opus 4.1 | Anthropic | 88.0% | — | Verified | Aug 5, 2025 | Details |
| MiniMax M2.1 | MiniMax | 87.5% | — | Verified | Dec 23, 2025 | Details |
| Claude Sonnet 4.5 | Anthropic | 87.5% | — | Verified | Sep 29, 2025 | Details |
| Claude Opus 4 | Anthropic | 87.3% | — | Verified | May 22, 2025 | Details |
| GPT-5 | OpenAI | 87.1% | — | Verified | Aug 7, 2025 | Details |
| GPT-5.1 | OpenAI | 87.0% | — | Verified | Nov 12, 2025 | Details |
| Grok 4 | xAI | 86.6% | — | Verified | Jul 9, 2025 | Details |
| Gemini 2.5 Pro | Google DeepMind | 86.2% | — | Verified | Mar 25, 2025 | Details |
| DeepSeek V3.2 | DeepSeek | 86.2% | — | Verified | Dec 1, 2025 | Details |
| GPT-5.2 | OpenAI | 85.9% | — | Verified | Dec 11, 2025 | Details |
| GLM-4.7 | Z.ai | 85.6% | — | Verified | Dec 22, 2025 | Details |
| Grok 4.1 fast reasoning | xAI | 85.4% | — | Verified | Nov 19, 2025 | Details |
| o3 | OpenAI | 85.3% | — | Verified | Apr 16, 2025 | Details |
| DeepSeek V3.1 | DeepSeek | 85.1% | — | Verified | Aug 21, 2025 | Details |
| DeepSeek R1 | DeepSeek | 84.9% | — | Verified | Jan 20, 2025 | Details |
| Kimi K2 Instruct | Moonshot AI | 84.8% | — | Verified | Jul 11, 2025 | Details |
| Kimi K2 Thinking | Moonshot AI | 84.6% | — | Unverified | Nov 6, 2025 | Details |
| Claude Sonnet 4 | Anthropic | 84.2% | — | Verified | May 22, 2025 | Details |
| Qwen3 Max | Alibaba / Qwen | 84.1% | — | Verified | Sep 5, 2025 | Details |
| Claude 3.7 Sonnet | Anthropic | 83.7% | — | Verified | Feb 24, 2025 | Details |
| GPT-5 mini | OpenAI | 83.7% | — | Verified | Aug 7, 2025 | Details |
| Gemini 2.5 Flash | Google DeepMind | 83.2% | — | Verified | Apr 17, 2025 | Details |
| o4-mini | OpenAI | 83.2% | — | Verified | Apr 16, 2025 | Details |
| GLM-4.6 | Z.ai | 82.9% | — | Verified | Sep 30, 2025 | Details |
| Qwen3 235B A22B Instruct 2507 | Alibaba / Qwen | 82.8% | — | Verified | Jul 21, 2025 | Details |
| DeepSeek V3 0324 | DeepSeek | 81.9% | — | Verified | Mar 24, 2025 | Details |
| Llama 4 Maverick | Meta | 80.9% | — | Verified | Apr 5, 2025 | Details |
| GPT-OSS-120B | OpenAI | 80.8% | — | Verified | Aug 5, 2025 | Details |
| GPT-4.1 | OpenAI | 80.6% | — | Verified | Apr 14, 2025 | Details |
| Claude Haiku 4.5 | Anthropic | 80.0% | — | Verified | Oct 15, 2025 | Details |
| Qwen 3 Coder 480B | Alibaba / Qwen | 78.8% | — | Verified | Jul 22, 2025 | Details |
| Gemini 2.0 Flash | Google DeepMind | 77.9% | — | Verified | Dec 11, 2024 | Details |
| Claude 3.5 Sonnet | Anthropic | 77.2% | — | Verified | Jun 20, 2024 | Details |
| DeepSeek V3 | DeepSeek | 75.9% | — | Verified | Dec 26, 2024 | Details |
| Llama 4 Scout | Meta | 75.2% | — | Verified | Apr 5, 2025 | Details |
| Llama 3.1 405B | Meta | 73.2% | — | Verified | Jul 23, 2024 | Details |
| Mistral Large | Mistral AI | 69.7% | — | Verified | Feb 26, 2024 | Details |
Each row reports the model’s accuracy on MMLU-Pro. Click a row for the full run context.