Question 1

What is MathVista?

Accepted Answer

A benchmark of 6,141 examples (evaluated on the 1,000-example testmini split) that measures mathematical reasoning in visual contexts, spanning figure QA, geometry, math word problems, textbook QA, and visual QA, reported as answer accuracy. It is a multimodal benchmark measured by accuracy.

Question 2

What does accuracy mean on MathVista?

Accepted Answer

MathVista reports accuracy (%); higher is better. Scores are shown only within MathVista and are never averaged with other benchmarks.

Question 3

What is the top reported MathVista score?

Accepted Answer

o3 has the top reported score on MathVista: 86.8% (accuracy).

Question 4

Why do MathVista scores differ across runs?

Accepted Answer

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Question 5

Does evals.report rank models across benchmarks?

Accepted Answer

No. MathVista scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

Model	Lab	Score↓	Source model	Status	Date
o3	OpenAI	86.8%	—	Verified	Apr 16, 2025	Details
o4-mini	OpenAI	84.3%	—	Verified	Apr 16, 2025	Details
Llama 4 MaverickOpen	Meta	73.7%	—	Verified	Apr 5, 2025	Details
Gemini 2.0 Flash	Google DeepMind	73.1%	—	Unverified	Dec 11, 2024	Details
GPT-4.1	OpenAI	72.2%	—	Verified	Apr 14, 2025	Details
Llama 4 ScoutOpen	Meta	70.7%	—	Verified	Apr 5, 2025	Details
Claude 3.5 Sonnet	Anthropic	67.7%	—	Verified	Jun 20, 2024	Details
Gemini 1.5 Pro	Google DeepMind	63.9%	—	Verified	Feb 15, 2024	Details
GPT-4o	OpenAI	63.8%	—	Verified	May 13, 2024	Details