BenchmarksReasoning
ARC-AGI-3
The interactive ARC-AGI-3 generalization benchmark: agents must learn novel game environments from scratch (semi-private set).
ReasoningaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Opus 4.6 | Anthropic | 0.51% | Anthropic Opus 4.6 (Max) | Official | May 19, 2026 | Details |
| GPT-5.5 high | OpenAI | 0.43% | GPT-5.5 (High) | Official | May 19, 2026 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 0.42% | Gemini 3.1 Pro (Preview) | Official | May 19, 2026 | Details |
| GPT-5.4 | OpenAI | 0.21% | GPT-5.4 (High) | Official | May 19, 2026 | Details |
| Claude Opus 4.7 | Anthropic | 0.18% | Opus 4.7 (High) | Official | May 19, 2026 | Details |
| Grok 4.20 beta reasoning | xAI | 0.09% | Grok 4.20 (Beta Reasoning) | Official | May 19, 2026 | Details |
Each row reports the model’s accuracy on ARC-AGI-3. Click a row for the full run context.