evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

MMLU-Pro

A more robust and challenging successor to MMLU with over 12,000 reasoning-focused questions across 14 subjects, expanding answer choices from four to ten to better discriminate frontier large language models.

ReasoningaccuracyHigher is better
ModelLabScoreSource modelStatusDate
Gemini 3.1 Pro PreviewGoogle DeepMind90.99%VerifiedFeb 19, 2026Details
Claude Opus 4.7Anthropic89.87%VerifiedApr 16, 2026Details
Gemini 3 ProGoogle DeepMind89.8%VerifiedNov 18, 2025Details
Claude Opus 4.5Anthropic89.5%VerifiedNov 24, 2025Details
Gemini 3 FlashGoogle DeepMind89.0%VerifiedDec 17, 2025Details
Claude Opus 4.1Anthropic88.0%VerifiedAug 5, 2025Details
MiniMax M2.1MiniMax87.5%VerifiedDec 23, 2025Details
Claude Sonnet 4.5Anthropic87.5%VerifiedSep 29, 2025Details
Claude Opus 4Anthropic87.3%VerifiedMay 22, 2025Details
GPT-5OpenAI87.1%VerifiedAug 7, 2025Details
GPT-5.1OpenAI87.0%VerifiedNov 12, 2025Details
Grok 4xAI86.6%VerifiedJul 9, 2025Details
Gemini 2.5 ProGoogle DeepMind86.2%VerifiedMar 25, 2025Details
DeepSeek V3.2DeepSeek86.2%VerifiedDec 1, 2025Details
GPT-5.2OpenAI85.9%VerifiedDec 11, 2025Details
GLM-4.7Z.ai85.6%VerifiedDec 22, 2025Details
Grok 4.1 fast reasoningxAI85.4%VerifiedNov 19, 2025Details
o3OpenAI85.3%VerifiedApr 16, 2025Details
DeepSeek V3.1DeepSeek85.1%VerifiedAug 21, 2025Details
DeepSeek R1DeepSeek84.9%VerifiedJan 20, 2025Details
Kimi K2 InstructMoonshot AI84.8%VerifiedJul 11, 2025Details
Kimi K2 ThinkingMoonshot AI84.6%UnverifiedNov 6, 2025Details
Claude Sonnet 4Anthropic84.2%VerifiedMay 22, 2025Details
Qwen3 MaxAlibaba / Qwen84.1%VerifiedSep 5, 2025Details
Claude 3.7 SonnetAnthropic83.7%VerifiedFeb 24, 2025Details
GPT-5 miniOpenAI83.7%VerifiedAug 7, 2025Details
Gemini 2.5 FlashGoogle DeepMind83.2%VerifiedApr 17, 2025Details
o4-miniOpenAI83.2%VerifiedApr 16, 2025Details
GLM-4.6Z.ai82.9%VerifiedSep 30, 2025Details
Qwen3 235B A22B Instruct 2507Alibaba / Qwen82.8%VerifiedJul 21, 2025Details
DeepSeek V3 0324DeepSeek81.9%VerifiedMar 24, 2025Details
Llama 4 MaverickMeta80.9%VerifiedApr 5, 2025Details
GPT-OSS-120BOpenAI80.8%VerifiedAug 5, 2025Details
GPT-4.1OpenAI80.6%VerifiedApr 14, 2025Details
Claude Haiku 4.5Anthropic80.0%VerifiedOct 15, 2025Details
Qwen 3 Coder 480BAlibaba / Qwen78.8%VerifiedJul 22, 2025Details
Gemini 2.0 FlashGoogle DeepMind77.9%VerifiedDec 11, 2024Details
Claude 3.5 SonnetAnthropic77.2%VerifiedJun 20, 2024Details
DeepSeek V3DeepSeek75.9%VerifiedDec 26, 2024Details
Llama 4 ScoutMeta75.2%VerifiedApr 5, 2025Details
Llama 3.1 405BMeta73.2%VerifiedJul 23, 2024Details
Mistral LargeMistral AI69.7%VerifiedFeb 26, 2024Details

Each row reports the model’s accuracy on MMLU-Pro. Click a row for the full run context.