evals.report
BenchmarksLabsCompareRun guides
BenchmarksMultimodal

ZeroBench

An intentionally 'impossible' visual reasoning benchmark of 100 hand-crafted main questions (plus 334 subquestions) on which contemporary large multimodal models score near zero, designed to provide maximum headroom for measuring genuine multi-step visual understanding.

MultimodalaccuracyHigher is better
ModelLabScoreSource modelStatusDate
GPT-5.4OpenAI23.0% (pass@5)VerifiedMar 5, 2026Details
GPT-5.5OpenAI22.0% (pass@5)VerifiedApr 23, 2026Details
Gemini 3 ProGoogle DeepMind19.0% (pass@5)VerifiedNov 18, 2025Details
Gemini 3.1 Pro PreviewGoogle DeepMind19.0% (pass@5)VerifiedFeb 19, 2026Details
GPT-5.2OpenAI17.0% (pass@5)VerifiedDec 11, 2025Details
Claude Opus 4.7Anthropic14.0% (pass@5)VerifiedApr 16, 2026Details
Gemini 3 FlashGoogle DeepMind13.0% (pass@5)VerifiedDec 17, 2025Details
Claude Opus 4.6Anthropic11.0% (pass@5)VerifiedFeb 5, 2026Details
Claude Opus 4.5Anthropic10.0% (pass@5)VerifiedNov 24, 2025Details
GPT-5.1OpenAI5.0% (pass@5)VerifiedNov 12, 2025Details
GPT-5 miniOpenAI4.0% (pass@1)VerifiedAug 7, 2025Details
o3OpenAI3.0% (pass@1)VerifiedApr 16, 2025Details
Gemini 2.5 ProGoogle DeepMind3.0% (pass@1)VerifiedMar 25, 2025Details
Gemini 2.5 FlashGoogle DeepMind3.0% (pass@1)VerifiedApr 17, 2025Details
o4-miniOpenAI2.0% (pass@1)VerifiedApr 16, 2025Details
Claude Sonnet 4Anthropic2.0% (pass@1)VerifiedMay 22, 2025Details
GPT-5OpenAI1.0% (pass@1)VerifiedAug 7, 2025Details
Claude Opus 4.1Anthropic1.0% (pass@1)VerifiedAug 5, 2025Details
Claude Opus 4Anthropic1.0% (pass@1)VerifiedMay 22, 2025Details
Claude 3.7 SonnetAnthropic1.0% (pass@1)VerifiedFeb 24, 2025Details
Grok 4xAI1.0% (pass@1)VerifiedJul 9, 2025Details
GPT-4.1OpenAI0.0% (pass@1)VerifiedApr 14, 2025Details
GPT-4oOpenAI0.0% (pass@1)VerifiedMay 13, 2024Details
Claude Sonnet 4.5Anthropic0.0% (pass@1)VerifiedSep 29, 2025Details
Claude 3.5 SonnetAnthropic0.0% (pass@1)VerifiedJun 20, 2024Details
Gemini 1.5 ProGoogle DeepMind0.0% (pass@1)VerifiedFeb 15, 2024Details
Llama 4 MaverickMeta0.0% (pass@1)VerifiedApr 5, 2025Details
Llama 4 ScoutMeta0.0% (pass@1)VerifiedApr 5, 2025Details

Each row reports the model’s accuracy on ZeroBench. Click a row for the full run context.