evals.report
BenchmarksSourcesLabsCompareRun guides
SourcesReasoning

ARC-AGI-3

Frontier interactive reasoning/generalization benchmark; current models still score near zero.

LaterRaw JSONStructured dataPartial run guidePublic data
Official source Benchmark page

Source detail

Score source

Official ARC Prize leaderboard JSON (v3.json) exposes per-model scores and cost.

Run guide

Dataset/task execution is documented, but frontier submissions are competition-style.

How it can be used

Use the official v3.json leaderboard rows with reported cost preserved.

Caveat

Competition submissions and private/evaluation splits make provenance important.

Evidence links 2