Question 1

What is MMLU-Pro?

Accepted Answer

A more robust and challenging successor to MMLU with over 12,000 reasoning-focused questions across 14 subjects, expanding answer choices from four to ten to better discriminate frontier large language models. It is a reasoning benchmark measured by accuracy.

Question 2

What does accuracy mean on MMLU-Pro?

Accepted Answer

MMLU-Pro reports accuracy (%); higher is better. Scores are shown only within MMLU-Pro and are never averaged with other benchmarks.

Question 3

What is the top reported MMLU-Pro score?

Accepted Answer

Gemini 3.1 Pro Preview has the top reported score on MMLU-Pro: 90.99% (accuracy).

Question 4

Why do MMLU-Pro scores differ across runs?

Accepted Answer

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Question 5

Does evals.report rank models across benchmarks?

Accepted Answer

No. MMLU-Pro scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

MMLU-Pro

What this benchmark measures

Frequently asked