evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

MMLU-ProX

A multilingual extension of MMLU-Pro spanning 29 typologically diverse languages with 11,829 parallel reasoning-focused multiple-choice questions (10 answer choices) per language, measuring LLM reasoning and knowledge across linguistic and cultural boundaries.

ReasoningaccuracyHigher is better
ModelLabScoreSource modelStatusDate
DeepSeek R1DeepSeek75.5%OfficialJan 20, 2025Details
GPT-4.1OpenAI72.7%OfficialApr 14, 2025Details
DeepSeek V3DeepSeek70.5%OfficialDec 26, 2024Details
o4-miniOpenAI69.3%OfficialApr 16, 2025Details
Llama 3.1 405BMeta60.1%VerifiedJul 23, 2024Details

Each row reports the model’s accuracy on MMLU-ProX. Click a row for the full run context.