evals.report
BenchmarksSourcesLabsCompareRun guides
SourcesReasoning

ARC-AGI-2

Widely cited abstraction/generalization benchmark; 2026 frontier models cleared it from ~6% to ~85%.

LaterRaw JSONStructured dataPartial run guidePublic data
Official source Benchmark page

Source detail

Score source

Official ARC Prize leaderboard JSON (v2.json) with per-model verified scores and reasoning effort.

Run guide

Tasks and evaluation are public; frontier scores are ARC-Prize-verified.

How it can be used

Use the official v2.json leaderboard rows; reasoning effort is encoded in each model label.

Caveat

Public and semi-private splits differ; keep the reported effort/compute as run context.

Evidence links 2