SourcesReasoning
ARC-AGI-3
Frontier interactive reasoning/generalization benchmark; current models still score near zero.
LaterRaw JSONStructured dataPartial run guidePublic data
Source detail
Score source
Official ARC Prize leaderboard JSON (v3.json) exposes per-model scores and cost.
Run guide
Dataset/task execution is documented, but frontier submissions are competition-style.
How it can be used
Use the official v3.json leaderboard rows with reported cost preserved.
Caveat
Competition submissions and private/evaluation splits make provenance important.