BenchmarksReasoning
EnigmaEval
A benchmark of 1,184 puzzle-hunt challenges spanning text and images that probes models' ability to perform implicit knowledge synthesis, lateral thinking, and multi-step deductive reasoning to uncover hidden solution paths.
ReasoningaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| GPT-5.4 Pro | OpenAI | 23.82% | — | Verified | Mar 5, 2026 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 19.76% | — | Verified | Feb 19, 2026 | Details |
| Gemini 3 Pro | Google DeepMind | 18.24% | — | Verified | Nov 18, 2025 | Details |
| GPT-5.4 | OpenAI | 15.96% | — | Verified | Mar 5, 2026 | Details |
| o3 | OpenAI | 13.09% | — | Verified | Apr 16, 2025 | Details |
| Claude Opus 4.5 | Anthropic | 11.91% | — | Verified | Nov 24, 2025 | Details |
| GPT-5.1 | OpenAI | 11.23% | — | Verified | Nov 12, 2025 | Details |
| GPT-5 | OpenAI | 10.47% | — | Verified | Aug 7, 2025 | Details |
| GPT-5.2 | OpenAI | 10.39% | — | Verified | Dec 11, 2025 | Details |
| o4-mini | OpenAI | 9.21% | — | Verified | Apr 16, 2025 | Details |
| GPT-5 mini | OpenAI | 8.19% | — | Verified | Aug 7, 2025 | Details |
| Claude Opus 4.6 | Anthropic | 7.60% | — | Verified | Feb 5, 2026 | Details |
| Claude Opus 4.1 | Anthropic | 7.18% | — | Verified | Aug 5, 2025 | Details |
| Claude Sonnet 4.5 | Anthropic | 6.00% | — | Verified | Sep 29, 2025 | Details |
| Claude Opus 4 | Anthropic | 5.57% | — | Verified | May 22, 2025 | Details |
| Gemini 2.5 Pro | Google DeepMind | 5.57% | — | Verified | Mar 25, 2025 | Details |
| Claude 3.7 Sonnet | Anthropic | 4.23% | — | Verified | Feb 24, 2025 | Details |
| Kimi K2.5 | Moonshot AI | 3.38% | — | Verified | Jan 27, 2026 | Details |
| Claude Sonnet 4 | Anthropic | 3.12% | — | Verified | May 22, 2025 | Details |
| Gemini 2.5 Flash | Google DeepMind | 2.70% | — | Verified | Apr 17, 2025 | Details |
| GPT-4.1 | OpenAI | 2.17% | — | Verified | Apr 14, 2025 | Details |
| Claude 3.5 Sonnet | Anthropic | 0.91% | — | Verified | Jun 20, 2024 | Details |
| GPT-4o | OpenAI | 0.80% | — | Verified | May 13, 2024 | Details |
| Gemini 2.0 Flash | Google DeepMind | 0.63% | — | Verified | Dec 11, 2024 | Details |
| Llama 4 Maverick | Meta | 0.58% | — | Verified | Apr 5, 2025 | Details |
Each row reports the model’s accuracy on EnigmaEval. Click a row for the full run context.