BenchmarksMultimodal
OCRBench v2
A large-scale bilingual (English/Chinese) text-centric benchmark of ~10,000 human-verified QA pairs across 31 scenarios that evaluates large multimodal models on visual text localization, recognition, parsing, and reasoning.
MultimodalaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Gemini 3 Pro | Google DeepMind | 63.4 | — | Verified | Nov 18, 2025 | Details |
| Gemini 2.5 Pro | Google DeepMind | 59.3 | — | Verified | Mar 25, 2025 | Details |
| GPT-5 | OpenAI | 55.5 | — | Verified | Aug 7, 2025 | Details |
| Gemini 1.5 Pro | Google DeepMind | 51.6 | — | Verified | Feb 15, 2024 | Details |
| GPT-5.2 | OpenAI | 50.5 | — | Verified | Dec 11, 2025 | Details |
| Claude Opus 4.6 | Anthropic | 48.4 | — | Verified | Feb 5, 2026 | Details |
| GPT-4o | OpenAI | 47.6 | — | Verified | May 13, 2024 | Details |
| Claude 3.5 Sonnet | Anthropic | 47.5 | — | Verified | Jun 20, 2024 | Details |
| Grok 4 | xAI | 45.0 | — | Verified | Jul 9, 2025 | Details |
| Claude Sonnet 4 | Anthropic | 42.4 | — | Verified | May 22, 2025 | Details |
Each row reports the model’s accuracy on OCRBench v2. Click a row for the full run context.