SourcesReasoning
ARC-AGI-2
Widely cited abstraction/generalization benchmark; 2026 frontier models cleared it from ~6% to ~85%.
LaterRaw JSONStructured dataPartial run guidePublic data
Source detail
Score source
Official ARC Prize leaderboard JSON (v2.json) with per-model verified scores and reasoning effort.
Run guide
Tasks and evaluation are public; frontier scores are ARC-Prize-verified.
How it can be used
Use the official v2.json leaderboard rows; reasoning effort is encoded in each model label.
Caveat
Public and semi-private splits differ; keep the reported effort/compute as run context.