evals.report
BenchmarksLabsCompareRun guidesIn the wild
BenchmarksReasoning

ARC-AGI-1

The original ARC-AGI-1 abstract-reasoning puzzle benchmark (semi-private set): few-shot grid transformations that are easy for humans but resist memorization. Largely cleared by 2026 frontier reasoning models, which is what motivated the harder ARC-AGI-2.

ReasoningaccuracyHigher is better

What is ARC-AGI-1?

The original ARC-AGI-1 abstract-reasoning puzzle benchmark (semi-private set): few-shot grid transformations that are easy for humans but resist memorization. Largely cleared by 2026 frontier reasoning models, which is what motivated the harder ARC-AGI-2. evals.report tracks reported ARC-AGI-1 scores with the model, source, status, date, and run caveats attached — official leaderboard scores, vendor-reported launches, and clearly labeled community runs.

Top reported ARC-AGI-1 score: Gemini 3.1 Pro Preview 98% (accuracy).

ModelLabScoreSource modelStatusDate
Gemini 3.1 Pro PreviewGoogle DeepMind98%Gemini 3.1 Pro (Preview)OfficialFeb 19, 2026Details
GPT-5.5 ProOpenAI96.5%GPT-5.5 Pro (High)OfficialApr 23, 2026Details
Gemini 3 Deep ThinkGoogle DeepMind96%Gemini 3 Deep Think (2/26)OfficialDec 4, 2025Details
GPT-5.5OpenAI95%GPT-5.5 (xHigh)OfficialApr 23, 2026Details
GPT-5.4 ProOpenAI94.5%GPT-5.4 Pro (xHigh)OfficialMar 5, 2026Details
Claude Opus 4.6Anthropic94%Claude Opus 4.6 (120K, High)OfficialFeb 5, 2026Details
GPT-5.4OpenAI93.67%GPT-5.4 (xHigh)OfficialMar 5, 2026Details
Gemini 3.5 FlashGoogle DeepMind92.5%Gemini 3.5 Flash (High)OfficialMay 19, 2026Details
Claude Opus 4.7Anthropic92%Claude 4.7 (Max)OfficialApr 16, 2026Details
Claude Opus 4.8Anthropic92%Claude Opus 4.8 (High)OfficialMay 28, 2026Details
Grok 4.20 beta reasoningxAI89.5%Grok 4.20 (Reasoning)OfficialMar 9, 2026Details
Claude Sonnet 4.6Anthropic86.5%Claude Sonnet 4.6 (High)OfficialFeb 17, 2026Details
GPT-5.2OpenAI86.17%GPT-5.2 (xHigh)OfficialDec 11, 2025Details
Gemini 3 FlashGoogle DeepMind84.67%Gemini 3 Flash Preview (High)OfficialDec 17, 2025Details
Claude Opus 4.5Anthropic80%Opus 4.5 (Thinking, 64K)OfficialNov 24, 2025Details
GLM-5.2OpenZ.ai77%GLM-5.2OfficialJun 16, 2026Details
Gemini 3 ProGoogle DeepMind75%Gemini 3 ProOfficialNov 18, 2025Details
GPT-5.1OpenAI72.83%GPT-5.1 (Thinking, High)OfficialNov 12, 2025Details
Grok 4xAI66.67%Grok 4 (Thinking)OfficialJul 9, 2025Details
GPT-5OpenAI65.67%GPT-5 (High)OfficialAug 7, 2025Details
Kimi K2.5OpenMoonshot AI65.33%Kimi K2.5OfficialJan 27, 2026Details
Claude Sonnet 4.5Anthropic63.67%Claude Sonnet 4.5 (Thinking 32K)OfficialSep 29, 2025Details
o3OpenAI60.83%o3 (High)OfficialApr 16, 2025Details
o4-miniOpenAI58.67%o4-mini (High)OfficialApr 16, 2025Details
DeepSeek V3.2OpenDeepSeek57%Deepseek V3.2OfficialDec 1, 2025Details
GPT-5 miniOpenAI54.33%GPT-5 Mini (High)OfficialAug 7, 2025Details
Claude Haiku 4.5Anthropic47.67%Claude Haiku 4.5 (Thinking 32K)OfficialOct 15, 2025Details
GLM-5OpenZ.ai44.67%GLM-5OfficialFeb 11, 2026Details
Claude Sonnet 4Anthropic40%Claude Sonnet 4 (Thinking 16K)OfficialMay 22, 2025Details

Each row reports the model’s accuracy on ARC-AGI-1. Click a row for the full run context.