BenchmarksReasoning
ARC-AGI-2
The ARC-AGI-2 abstract-reasoning puzzle benchmark (semi-private set), the harder static successor to ARC-AGI-1.
ReasoningaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| GPT-5.5 | OpenAI | 85% | GPT-5.5 (xHigh) | Official | May 19, 2026 | Details |
| Gemini 3 Deep Think | Google DeepMind | 84.58% | Gemini 3 Deep Think (2/26) | Official | May 19, 2026 | Details |
| GPT-5.5 Pro | OpenAI | 84.58% | GPT-5.5 Pro (High) | Official | May 19, 2026 | Details |
| GPT-5.4 Pro | OpenAI | 83.33% | GPT-5.4 Pro (xHigh) | Official | May 19, 2026 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 77.08% | Gemini 3.1 Pro (Preview) | Official | May 19, 2026 | Details |
| Claude Opus 4.7 | Anthropic | 75.83% | Claude 4.7 (Max) | Official | May 19, 2026 | Details |
| GPT-5.4 xHigh | OpenAI | 73.95% | GPT-5.4 (xHigh) | Official | May 19, 2026 | Details |
| Gemini 3.5 Flash | Google DeepMind | 72.08% | Gemini 3.5 Flash (High) | Official | May 19, 2026 | Details |
| Claude Opus 4.6 | Anthropic | 69.17% | Claude Opus 4.6 (120K, High) | Official | May 19, 2026 | Details |
| o3 | OpenAI | 6.53% | o3 (High) | Official | May 19, 2026 | Details |
| o4-mini (high) | OpenAI | 6.11% | o4-mini (High) | Official | May 19, 2026 | Details |
| Claude Sonnet 4 | Anthropic | 5.93% | Claude Sonnet 4 (Thinking 16K) | Official | May 19, 2026 | Details |
Each row reports the model’s accuracy on ARC-AGI-2. Click a row for the full run context.