BenchmarksReasoning
MMLU-ProX
A multilingual extension of MMLU-Pro spanning 29 typologically diverse languages with 11,829 parallel reasoning-focused multiple-choice questions (10 answer choices) per language, measuring LLM reasoning and knowledge across linguistic and cultural boundaries.
ReasoningaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| DeepSeek R1 | DeepSeek | 75.5% | — | Official | Jan 20, 2025 | Details |
| GPT-4.1 | OpenAI | 72.7% | — | Official | Apr 14, 2025 | Details |
| DeepSeek V3 | DeepSeek | 70.5% | — | Official | Dec 26, 2024 | Details |
| o4-mini | OpenAI | 69.3% | — | Official | Apr 16, 2025 | Details |
| Llama 3.1 405B | Meta | 60.1% | — | Verified | Jul 23, 2024 | Details |
Each row reports the model’s accuracy on MMLU-ProX. Click a row for the full run context.