BenchmarksMultimodal
CharXiv
A multimodal benchmark of 2,323 real scientific charts from arXiv papers that evaluates chart understanding in MLLMs via descriptive questions and complex reasoning questions, with the reasoning split (CharXiv-R) measuring accuracy on questions that require synthesizing information across chart elements.
MultimodalaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Mythos Preview | Anthropic | 93.2% | — | Unverified | Apr 7, 2026 | Details |
| Claude Opus 4.7 | Anthropic | 91.0% | — | Unverified | Apr 16, 2026 | Details |
| Claude Opus 4.8 | Anthropic | 89.9% | — | Unverified | May 28, 2026 | Details |
| Kimi K2.6 | Moonshot AI | 86.7% | — | Unverified | Apr 20, 2026 | Details |
| Muse Spark | Meta | 86.4% | — | Unverified | Apr 8, 2026 | Details |
| Gemini 3.5 Flash | Google DeepMind | 84.2% | — | Unverified | May 19, 2026 | Details |
| GPT-5.5 | OpenAI | 84.1% | — | Unverified | Apr 23, 2026 | Details |
| GPT-5.2 | OpenAI | 82.1% | — | Unverified | Dec 11, 2025 | Details |
| Qwen 3.6 Plus | Alibaba / Qwen | 81.5% | — | Unverified | Apr 2, 2026 | Details |
| Gemini 3 Pro | Google DeepMind | 81.4% | — | Unverified | Nov 18, 2025 | Details |
| GPT-5 | OpenAI | 81.1% | — | Unverified | Aug 7, 2025 | Details |
| Gemini 3 Flash | Google DeepMind | 80.3% | — | Unverified | Dec 17, 2025 | Details |
| o3 | OpenAI | 78.6% | — | Unverified | Apr 16, 2025 | Details |
| Kimi K2.5 | Moonshot AI | 77.5% | — | Unverified | Jan 27, 2026 | Details |
| Claude Opus 4.6 | Anthropic | 77.4% | — | Unverified | Feb 5, 2026 | Details |
| o4-mini | OpenAI | 72.0% | — | Unverified | Apr 16, 2025 | Details |
| GPT-4o | OpenAI | 58.8% | — | Unverified | May 13, 2024 | Details |
| GPT-4.1 | OpenAI | 56.7% | — | Unverified | Apr 14, 2025 | Details |
Each row reports the model’s accuracy on CharXiv. Click a row for the full run context.