BenchmarksMultimodal
MathVista
A benchmark of 6,141 examples (evaluated on the 1,000-example testmini split) that measures mathematical reasoning in visual contexts, spanning figure QA, geometry, math word problems, textbook QA, and visual QA, reported as answer accuracy.
MultimodalaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| o3 | OpenAI | 86.8% | — | Verified | Apr 16, 2025 | Details |
| o4-mini | OpenAI | 84.3% | — | Verified | Apr 16, 2025 | Details |
| Llama 4 Maverick | Meta | 73.7% | — | Verified | Apr 5, 2025 | Details |
| Gemini 2.0 Flash | Google DeepMind | 73.1% | — | Unverified | Dec 11, 2024 | Details |
| GPT-4.1 | OpenAI | 72.2% | — | Verified | Apr 14, 2025 | Details |
| Llama 4 Scout | Meta | 70.7% | — | Verified | Apr 5, 2025 | Details |
| Claude 3.5 Sonnet | Anthropic | 67.7% | — | Verified | Jun 20, 2024 | Details |
| Gemini 1.5 Pro | Google DeepMind | 63.9% | — | Verified | Feb 15, 2024 | Details |
| GPT-4o | OpenAI | 63.8% | — | Verified | May 13, 2024 | Details |
Each row reports the model’s accuracy on MathVista. Click a row for the full run context.