evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

ARC-AGI-3

The interactive ARC-AGI-3 generalization benchmark: agents must learn novel game environments from scratch (semi-private set).

ReasoningaccuracyHigher is better

What this benchmark measures

The interactive ARC-AGI-3 generalization benchmark: agents must learn novel game environments from scratch (semi-private set).

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is accuracy. It should be interpreted within ARC-AGI-3, not compared as part of a site-wide ranking.

What to be careful about

Competition submissions and private/evaluation splits make provenance important.

No composite ranking
evals.report never combines benchmarks. accuracy on ARC-AGI-3 is its own number — don’t average it with other metrics.