evals.report
BenchmarksLabsCompareRun guides
BenchmarksMultimodal

OCRBench v2

A large-scale bilingual (English/Chinese) text-centric benchmark of ~10,000 human-verified QA pairs across 31 scenarios that evaluates large multimodal models on visual text localization, recognition, parsing, and reasoning.

MultimodalaccuracyHigher is better
ModelLabScoreSource modelStatusDate
Gemini 3 ProGoogle DeepMind63.4VerifiedNov 18, 2025Details
Gemini 2.5 ProGoogle DeepMind59.3VerifiedMar 25, 2025Details
GPT-5OpenAI55.5VerifiedAug 7, 2025Details
Gemini 1.5 ProGoogle DeepMind51.6VerifiedFeb 15, 2024Details
GPT-5.2OpenAI50.5VerifiedDec 11, 2025Details
Claude Opus 4.6Anthropic48.4VerifiedFeb 5, 2026Details
GPT-4oOpenAI47.6VerifiedMay 13, 2024Details
Claude 3.5 SonnetAnthropic47.5VerifiedJun 20, 2024Details
Grok 4xAI45.0VerifiedJul 9, 2025Details
Claude Sonnet 4Anthropic42.4VerifiedMay 22, 2025Details

Each row reports the model’s accuracy on OCRBench v2. Click a row for the full run context.