BenchmarksMultimodal
ZeroBench
An intentionally 'impossible' visual reasoning benchmark of 100 hand-crafted main questions (plus 334 subquestions) on which contemporary large multimodal models score near zero, designed to provide maximum headroom for measuring genuine multi-step visual understanding.
MultimodalaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| GPT-5.4 | OpenAI | 23.0% (pass@5) | — | Verified | Mar 5, 2026 | Details |
| GPT-5.5 | OpenAI | 22.0% (pass@5) | — | Verified | Apr 23, 2026 | Details |
| Gemini 3 Pro | Google DeepMind | 19.0% (pass@5) | — | Verified | Nov 18, 2025 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 19.0% (pass@5) | — | Verified | Feb 19, 2026 | Details |
| GPT-5.2 | OpenAI | 17.0% (pass@5) | — | Verified | Dec 11, 2025 | Details |
| Claude Opus 4.7 | Anthropic | 14.0% (pass@5) | — | Verified | Apr 16, 2026 | Details |
| Gemini 3 Flash | Google DeepMind | 13.0% (pass@5) | — | Verified | Dec 17, 2025 | Details |
| Claude Opus 4.6 | Anthropic | 11.0% (pass@5) | — | Verified | Feb 5, 2026 | Details |
| Claude Opus 4.5 | Anthropic | 10.0% (pass@5) | — | Verified | Nov 24, 2025 | Details |
| GPT-5.1 | OpenAI | 5.0% (pass@5) | — | Verified | Nov 12, 2025 | Details |
| GPT-5 mini | OpenAI | 4.0% (pass@1) | — | Verified | Aug 7, 2025 | Details |
| o3 | OpenAI | 3.0% (pass@1) | — | Verified | Apr 16, 2025 | Details |
| Gemini 2.5 Pro | Google DeepMind | 3.0% (pass@1) | — | Verified | Mar 25, 2025 | Details |
| Gemini 2.5 Flash | Google DeepMind | 3.0% (pass@1) | — | Verified | Apr 17, 2025 | Details |
| o4-mini | OpenAI | 2.0% (pass@1) | — | Verified | Apr 16, 2025 | Details |
| Claude Sonnet 4 | Anthropic | 2.0% (pass@1) | — | Verified | May 22, 2025 | Details |
| GPT-5 | OpenAI | 1.0% (pass@1) | — | Verified | Aug 7, 2025 | Details |
| Claude Opus 4.1 | Anthropic | 1.0% (pass@1) | — | Verified | Aug 5, 2025 | Details |
| Claude Opus 4 | Anthropic | 1.0% (pass@1) | — | Verified | May 22, 2025 | Details |
| Claude 3.7 Sonnet | Anthropic | 1.0% (pass@1) | — | Verified | Feb 24, 2025 | Details |
| Grok 4 | xAI | 1.0% (pass@1) | — | Verified | Jul 9, 2025 | Details |
| GPT-4.1 | OpenAI | 0.0% (pass@1) | — | Verified | Apr 14, 2025 | Details |
| GPT-4o | OpenAI | 0.0% (pass@1) | — | Verified | May 13, 2024 | Details |
| Claude Sonnet 4.5 | Anthropic | 0.0% (pass@1) | — | Verified | Sep 29, 2025 | Details |
| Claude 3.5 Sonnet | Anthropic | 0.0% (pass@1) | — | Verified | Jun 20, 2024 | Details |
| Gemini 1.5 Pro | Google DeepMind | 0.0% (pass@1) | — | Verified | Feb 15, 2024 | Details |
| Llama 4 Maverick | Meta | 0.0% (pass@1) | — | Verified | Apr 5, 2025 | Details |
| Llama 4 Scout | Meta | 0.0% (pass@1) | — | Verified | Apr 5, 2025 | Details |
Each row reports the model’s accuracy on ZeroBench. Click a row for the full run context.