evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

ARC-AGI-2

The ARC-AGI-2 abstract-reasoning puzzle benchmark (semi-private set), the harder static successor to ARC-AGI-1.

ReasoningaccuracyHigher is better
ModelLabScoreSource modelStatusDate
GPT-5.5OpenAI85%GPT-5.5 (xHigh)OfficialMay 19, 2026Details
Gemini 3 Deep ThinkGoogle DeepMind84.58%Gemini 3 Deep Think (2/26)OfficialMay 19, 2026Details
GPT-5.5 ProOpenAI84.58%GPT-5.5 Pro (High)OfficialMay 19, 2026Details
GPT-5.4 ProOpenAI83.33%GPT-5.4 Pro (xHigh)OfficialMay 19, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind77.08%Gemini 3.1 Pro (Preview)OfficialMay 19, 2026Details
Claude Opus 4.7Anthropic75.83%Claude 4.7 (Max)OfficialMay 19, 2026Details
GPT-5.4 xHighOpenAI73.95%GPT-5.4 (xHigh)OfficialMay 19, 2026Details
Gemini 3.5 FlashGoogle DeepMind72.08%Gemini 3.5 Flash (High)OfficialMay 19, 2026Details
Claude Opus 4.6Anthropic69.17%Claude Opus 4.6 (120K, High)OfficialMay 19, 2026Details
o3OpenAI6.53%o3 (High)OfficialMay 19, 2026Details
o4-mini (high)OpenAI6.11%o4-mini (High)OfficialMay 19, 2026Details
Claude Sonnet 4Anthropic5.93%Claude Sonnet 4 (Thinking 16K)OfficialMay 19, 2026Details

Each row reports the model’s accuracy on ARC-AGI-2. Click a row for the full run context.