How to run ARC-AGI-2 — benchmark guide

Run guidesReasoning

ARC-AGI-2 ships its public eval set (120 tasks) and training set (1000 tasks) as JSON in the arcprize/ARC-AGI-2 repo, but that repo has no harness, no scoring script, and no apps/ folder. The official arcprize/arc-agi-benchmarking harness runs your own model (configured in src/arc_agi_benchmarking/models.yml via provider adapters) over those task JSONs and scores them with exact-match correctness. You generate predictions with cli/run_all.py, then grade with scoring/scoring.py whose --task_dir points at the source taskset that holds the solutions. Keep attached to any score: which split (public_eval vs training), the model config name/version, num_attempts (ARC allows 2 trials per test input), and the arc-agi-benchmarking commit, since only the public eval set is runnable locally (semi-private/private sets are leaderboard-gated and not in the repo).

Benchmark

ARC-AGI-2

Repository

github.com/arcprize/arc-agi-benchmarking

Dataset

github.com/arcprize/ARC-AGI-2

Metric

accuracy

1Install

shell

git clone https://github.com/arcprize/arc-agi-benchmarking.git

shell

# from inside the cloned arc-agi-benchmarking directory:
uv sync

shell

cp .env.example .env   # then fill in provider keys, e.g. OPENAI_API_KEY / ANTHROPIC_API_KEY / GEMINI_API_KEY

shell

git clone https://github.com/arcprize/ARC-AGI-2.git data/arc-agi

2Run evaluation

shell

# (optional) smoke test the pipeline with the bundled sample tasks and the random baseline:
uv run cli/run_all.py --config random-baseline --data_dir data/sample/tasks --save_submission_dir submissions/random-baseline-sample --log-level INFO

shell

# Real run over the ARC-AGI-2 public evaluation set with your model config (add your model to src/arc_agi_benchmarking/models.yml, then pass its name via --config; set --num_attempts 2 to match the ARC 2-trials protocol):
uv run cli/run_all.py --config <your-model-config> --data_dir data/arc-agi/data/evaluation --save_submission_dir submissions/<your-model-config> --num_attempts 2 --log-level INFO

shell

# Single-task debug run:
uv run main.py --data_dir data/sample/tasks --config random-baseline --task_id 66e6c45b --save_submission_dir submissions/random-single --log-level INFO

3Score output

shell

uv run src/arc_agi_benchmarking/scoring/scoring.py --task_dir data/arc-agi/data/evaluation --submission_dir submissions/<your-model-config> --results_dir results/<your-model-config>

4Expected output

cli/run_all.py writes per-task prediction JSONs into the --save_submission_dir (README-recommended layout <save_submission_dir>/<config>/<version>/<eval_type>/, e.g. submissions/gpt-4o-2024-11-20/v1/public_eval/). scoring.py reads those plus the source taskset (which contains the solutions) and writes aggregate results (results.json) into --results_dir, reporting exact-match accuracy: a task counts as solved only when the predicted output grid matches the validated solution exactly in shape, color, and position. This is public-eval-set accuracy; do not compare it against semi-private/private leaderboard numbers.

5Submit results

ARC-AGI-2 has no automated public submission for local runs against your own model: the official leaderboard at arcprize.org/leaderboard verifies semi-private/private sets that are not in the repo. For self-reported public-eval results you can publish your code/scores (e.g. via the ARC Prize community leaderboard, which expects a link to a public code repo and a public-set score); such scores are self-reported and unverified by ARC Prize. Always attach: split (public_eval), model config name + version, num_attempts (ARC permits 2 trials per test input), and the arc-agi-benchmarking commit.

Gotchas

The arcprize/ARC-AGI-2 repo itself has NO harness, scoring script, or apps/ folder - it only ships task JSONs (data/training = 1000 tasks, data/evaluation = 120 tasks) and points to the ARC-AGI-1 browser UI for manual play. Use arcprize/arc-agi-benchmarking for any programmatic evaluation.

Only the 120-task public evaluation set is runnable locally. The semi-private and fully-private sets used for the official leaderboard are NOT in the repo, so a local public-eval score is not directly comparable to leaderboard figures (calibrated to similar difficulty but measured on different withheld tasks).

scoring.py needs --task_dir pointed at the SOURCE taskset that still contains the solution outputs (data/arc-agi/data/evaluation), not at your submission dir; the submission dir holds only your predictions. The harness clones ARC-AGI-2 into data/arc-agi, so the eval JSONs live at data/arc-agi/data/evaluation.

ARC scoring is exact-match under the 2-trials rule - set --num_attempts 2 to match the official protocol; a near-miss grid scores zero, so partial/pixel-correctness is not the headline accuracy.

scoring.py uses --print_logs (underscore) not --print-logs; run_all.py exposes --num_attempts and --retry_attempts. Each run_all invocation handles one model config, so run multiple configs by invoking it repeatedly.