evals.report
BenchmarksLabsCompareRun guides
BenchmarksMultimodal

MathVista

A benchmark of 6,141 examples (evaluated on the 1,000-example testmini split) that measures mathematical reasoning in visual contexts, spanning figure QA, geometry, math word problems, textbook QA, and visual QA, reported as answer accuracy.

MultimodalaccuracyHigher is better
ModelLabScoreSource modelStatusDate
o3OpenAI86.8%VerifiedApr 16, 2025Details
o4-miniOpenAI84.3%VerifiedApr 16, 2025Details
Llama 4 MaverickMeta73.7%VerifiedApr 5, 2025Details
Gemini 2.0 FlashGoogle DeepMind73.1%UnverifiedDec 11, 2024Details
GPT-4.1OpenAI72.2%VerifiedApr 14, 2025Details
Llama 4 ScoutMeta70.7%VerifiedApr 5, 2025Details
Claude 3.5 SonnetAnthropic67.7%VerifiedJun 20, 2024Details
Gemini 1.5 ProGoogle DeepMind63.9%VerifiedFeb 15, 2024Details
GPT-4oOpenAI63.8%VerifiedMay 13, 2024Details

Each row reports the model’s accuracy on MathVista. Click a row for the full run context.