evals.report
BenchmarksLabsCompareRun guides
BenchmarksMultimodal

CharXiv

A multimodal benchmark of 2,323 real scientific charts from arXiv papers that evaluates chart understanding in MLLMs via descriptive questions and complex reasoning questions, with the reasoning split (CharXiv-R) measuring accuracy on questions that require synthesizing information across chart elements.

MultimodalaccuracyHigher is better
ModelLabScoreSource modelStatusDate
Claude Mythos PreviewAnthropic93.2%UnverifiedApr 7, 2026Details
Claude Opus 4.7Anthropic91.0%UnverifiedApr 16, 2026Details
Claude Opus 4.8Anthropic89.9%UnverifiedMay 28, 2026Details
Kimi K2.6Moonshot AI86.7%UnverifiedApr 20, 2026Details
Muse SparkMeta86.4%UnverifiedApr 8, 2026Details
Gemini 3.5 FlashGoogle DeepMind84.2%UnverifiedMay 19, 2026Details
GPT-5.5OpenAI84.1%UnverifiedApr 23, 2026Details
GPT-5.2OpenAI82.1%UnverifiedDec 11, 2025Details
Qwen 3.6 PlusAlibaba / Qwen81.5%UnverifiedApr 2, 2026Details
Gemini 3 ProGoogle DeepMind81.4%UnverifiedNov 18, 2025Details
GPT-5OpenAI81.1%UnverifiedAug 7, 2025Details
Gemini 3 FlashGoogle DeepMind80.3%UnverifiedDec 17, 2025Details
o3OpenAI78.6%UnverifiedApr 16, 2025Details
Kimi K2.5Moonshot AI77.5%UnverifiedJan 27, 2026Details
Claude Opus 4.6Anthropic77.4%UnverifiedFeb 5, 2026Details
o4-miniOpenAI72.0%UnverifiedApr 16, 2025Details
GPT-4oOpenAI58.8%UnverifiedMay 13, 2024Details
GPT-4.1OpenAI56.7%UnverifiedApr 14, 2025Details

Each row reports the model’s accuracy on CharXiv. Click a row for the full run context.