ARC-AGI-1
The original ARC-AGI-1 abstract-reasoning puzzle benchmark (semi-private set): few-shot grid transformations that are easy for humans but resist memorization. Largely cleared by 2026 frontier reasoning models, which is what motivated the harder ARC-AGI-2.
What is ARC-AGI-1?
The original ARC-AGI-1 abstract-reasoning puzzle benchmark (semi-private set): few-shot grid transformations that are easy for humans but resist memorization. Largely cleared by 2026 frontier reasoning models, which is what motivated the harder ARC-AGI-2. evals.report tracks reported ARC-AGI-1 scores with the model, source, status, date, and run caveats attached — official leaderboard scores, vendor-reported launches, and clearly labeled community runs.
Top reported ARC-AGI-1 score: Gemini 3.1 Pro Preview — 98% (accuracy).
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Gemini 3.1 Pro Preview | Google DeepMind | 98% | Gemini 3.1 Pro (Preview) | Official | Feb 19, 2026 | Details |
| GPT-5.5 Pro | OpenAI | 96.5% | GPT-5.5 Pro (High) | Official | Apr 23, 2026 | Details |
| Gemini 3 Deep Think | Google DeepMind | 96% | Gemini 3 Deep Think (2/26) | Official | Dec 4, 2025 | Details |
| GPT-5.5 | OpenAI | 95% | GPT-5.5 (xHigh) | Official | Apr 23, 2026 | Details |
| GPT-5.4 Pro | OpenAI | 94.5% | GPT-5.4 Pro (xHigh) | Official | Mar 5, 2026 | Details |
| Claude Opus 4.6 | Anthropic | 94% | Claude Opus 4.6 (120K, High) | Official | Feb 5, 2026 | Details |
| GPT-5.4 | OpenAI | 93.67% | GPT-5.4 (xHigh) | Official | Mar 5, 2026 | Details |
| Gemini 3.5 Flash | Google DeepMind | 92.5% | Gemini 3.5 Flash (High) | Official | May 19, 2026 | Details |
| Claude Opus 4.7 | Anthropic | 92% | Claude 4.7 (Max) | Official | Apr 16, 2026 | Details |
| Claude Opus 4.8 | Anthropic | 92% | Claude Opus 4.8 (High) | Official | May 28, 2026 | Details |
| Grok 4.20 beta reasoning | xAI | 89.5% | Grok 4.20 (Reasoning) | Official | Mar 9, 2026 | Details |
| Claude Sonnet 4.6 | Anthropic | 86.5% | Claude Sonnet 4.6 (High) | Official | Feb 17, 2026 | Details |
| GPT-5.2 | OpenAI | 86.17% | GPT-5.2 (xHigh) | Official | Dec 11, 2025 | Details |
| Gemini 3 Flash | Google DeepMind | 84.67% | Gemini 3 Flash Preview (High) | Official | Dec 17, 2025 | Details |
| Claude Opus 4.5 | Anthropic | 80% | Opus 4.5 (Thinking, 64K) | Official | Nov 24, 2025 | Details |
| GLM-5.2Open | Z.ai | 77% | GLM-5.2 | Official | Jun 16, 2026 | Details |
| Gemini 3 Pro | Google DeepMind | 75% | Gemini 3 Pro | Official | Nov 18, 2025 | Details |
| GPT-5.1 | OpenAI | 72.83% | GPT-5.1 (Thinking, High) | Official | Nov 12, 2025 | Details |
| Grok 4 | xAI | 66.67% | Grok 4 (Thinking) | Official | Jul 9, 2025 | Details |
| GPT-5 | OpenAI | 65.67% | GPT-5 (High) | Official | Aug 7, 2025 | Details |
| Kimi K2.5Open | Moonshot AI | 65.33% | Kimi K2.5 | Official | Jan 27, 2026 | Details |
| Claude Sonnet 4.5 | Anthropic | 63.67% | Claude Sonnet 4.5 (Thinking 32K) | Official | Sep 29, 2025 | Details |
| o3 | OpenAI | 60.83% | o3 (High) | Official | Apr 16, 2025 | Details |
| o4-mini | OpenAI | 58.67% | o4-mini (High) | Official | Apr 16, 2025 | Details |
| DeepSeek V3.2Open | DeepSeek | 57% | Deepseek V3.2 | Official | Dec 1, 2025 | Details |
| GPT-5 mini | OpenAI | 54.33% | GPT-5 Mini (High) | Official | Aug 7, 2025 | Details |
| Claude Haiku 4.5 | Anthropic | 47.67% | Claude Haiku 4.5 (Thinking 32K) | Official | Oct 15, 2025 | Details |
| GLM-5Open | Z.ai | 44.67% | GLM-5 | Official | Feb 11, 2026 | Details |
| Claude Sonnet 4 | Anthropic | 40% | Claude Sonnet 4 (Thinking 16K) | Official | May 22, 2025 | Details |
Each row reports the model’s accuracy on ARC-AGI-1. Click a row for the full run context.