What does accuracy mean on MMMU-Pro?

MMMU-Pro reports accuracy (%); higher is better. Scores are shown only within MMMU-Pro and are never averaged with other benchmarks.

What is the top reported MMMU-Pro score?

GPT-5.6 Sol has the top reported score on MMMU-Pro: 83.0% (accuracy).

Why do MMMU-Pro scores differ across runs?

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Does evals.report rank models across benchmarks?

No. MMMU-Pro scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

BenchmarksMultimodal

MMMU-Pro

The harder MMMU-Pro multimodal reasoning benchmark (college-level subject tasks with text and images); the variant current frontier models report.

MultimodalaccuracyHigher is better

Scores About Run this benchmark

MMMU-Pro is run via the official MMMU-Benchmark/MMMU repo's mmmu-pro/ subdirectory. You run an inference script (infer/infer_*.py) that pulls the MMMU/MMMU_Pro dataset from Hugging Face, queries your model, and writes per-record .jsonl files to ./output; then evaluate.py scans ./output and prints accuracy. Keep attached to any score: the MODE (cot vs direct), the SETTING (standard 10-option vs standard 4-option vs vision), the model/backend (infer_gpt/gemini/lmdeploy/transformers), and the harness commit, since these change reported accuracy substantially.

Benchmark

MMMU-Pro

Repository

github.com/MMMU-Benchmark/MMMU

Dataset

huggingface.co/datasets/MMMU/MMMU_Pro

Metric

accuracy

1Install

shell

git clone https://github.com/MMMU-Benchmark/MMMU.git

shell

cd MMMU/mmmu-pro

shell

# No requirements.txt ships; install deps inferred from the infer/*.py + evaluate.py imports:
pip install datasets pyyaml pillow tqdm requests openai numpy pandas

shell

# Plus the backend matching the infer script you pick:
# pip install lmdeploy   (for infer/infer_lmdeploy.py)
# pip install transformers torch accelerate   (for infer/infer_transformers.py)

2Run evaluation

shell

# Run from MMMU/mmmu-pro. README pattern: python infer/infer_xxx.py [MODEL_NAME] [MODE] [SETTING]
# MODE in {cot, direct}; SETTING in {standard (10 options), standard (4 options), vision}
# Official README example (OpenAI-compatible API model). Set API_KEY inside infer_gpt.py first:
python infer/infer_gpt.py gpt-4o cot vision

shell

# lmdeploy backend (same positional-arg interface as infer_gpt):
python infer/infer_lmdeploy.py InternVL2-8B cot vision

shell

# transformers backend uses argparse FLAGS (NOT positional) and must be run from the REPO ROOT (it opens mmmu-pro/prompts.yaml):
# cd ..  &&  python mmmu-pro/infer/infer_transformers.py --model <hf-model-id> --mode cot --dataset_variant vision

3Score output

shell

# Run from MMMU/mmmu-pro:
python evaluate.py

4Expected output

infer_gpt.py / infer_lmdeploy.py write one JSON Lines file to ./output named {MODEL}_{SETTING}_{MODE}.jsonl (e.g. gpt-4o_vision_cot.jsonl), each line a dataset record with image_* keys stripped and a 'response' field appended. evaluate.py scans ./output (non-recursive os.listdir), matches files via regex (model)_(standard|vision)_(cot|direct).jsonl, checks each has NUM=1730 records, computes accuracy, prints a line like 'Model: ... Method: ... Setting: ... - Accuracy: NN.NN%', and rewrites processed results back into ./output. Report accuracy only within the same MODE+SETTING you ran; do not average or compare across different settings.

5Submit results

There is no automated submission endpoint in the harness; you self-report accuracy from evaluate.py's printed line. Always attach the run context: harness commit of MMMU-Benchmark/MMMU, the inference script/backend used (infer_gpt / infer_gemini / infer_lmdeploy / infer_transformers), MODE (cot|direct), SETTING (standard (10 options) | standard (4 options) | vision), and the exact model id. For the public leaderboard, follow the instructions on https://mmmu-benchmark.github.io/.

Gotchas

SETTING is passed straight to load_dataset('MMMU/MMMU_Pro', SETTING, split='test') as the HF config name; the three real config names are exactly 'standard (10 options)', 'standard (4 options)', 'vision'. Accuracy differs a lot across these and between cot vs direct MODE -- never mix them when reporting.

infer_transformers.py is NOT positional-arg like the README pattern: it uses argparse flags (--model, --mode, --dataset_variant), must be launched from the repo ROOT (it opens 'mmmu-pro/prompts.yaml'), and writes to ./output/{dataset_name}/{model}_{variant}_{mode}.jsonl (a SUBDIRECTORY with the full variant string). evaluate.py uses a non-recursive os.listdir('./output') and a regex capturing only 'standard|vision', so transformers outputs are NOT auto-scored -- you must move/rename the file into ./output to match {model}_(standard|vision)_(cot|direct).jsonl.

Credentials/config are hard-coded placeholders: infer_gpt.py has API_KEY='your_api_key' and base_url defaulting to OpenAI; you must edit the script before running. There is no requirements.txt; deps are inferred from imports (datasets, pyyaml, pillow, tqdm, requests, openai, numpy, pandas, plus lmdeploy or transformers/torch).

Only four infer_*.py scripts exist (infer_gpt, infer_gemini, infer_lmdeploy, infer_transformers); there is no generic 'infer.py'. evaluate.py expects exactly NUM=1730 records per file and prints an error+skips the file otherwise. In the 10-option setting the option order is shuffled, so <image i> tokens may not appear sequentially; each token still maps to its image_i key (see replace_images_tokens / origin_mmmu_doc_to_visual).