evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

ARC-AGI-3

The interactive ARC-AGI-3 generalization benchmark: agents must learn novel game environments from scratch (semi-private set).

ReasoningaccuracyHigher is better
ModelLabScoreSource modelStatusDate
Claude Opus 4.6Anthropic0.51%Anthropic Opus 4.6 (Max)OfficialMay 19, 2026Details
GPT-5.5 highOpenAI0.43%GPT-5.5 (High)OfficialMay 19, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind0.42%Gemini 3.1 Pro (Preview)OfficialMay 19, 2026Details
GPT-5.4OpenAI0.21%GPT-5.4 (High)OfficialMay 19, 2026Details
Claude Opus 4.7Anthropic0.18%Opus 4.7 (High)OfficialMay 19, 2026Details
Grok 4.20 beta reasoningxAI0.09%Grok 4.20 (Beta Reasoning)OfficialMay 19, 2026Details

Each row reports the model’s accuracy on ARC-AGI-3. Click a row for the full run context.