What does accuracy mean on ARC-AGI-2?

ARC-AGI-2 reports accuracy (%); higher is better. Scores are shown only within ARC-AGI-2 and are never averaged with other benchmarks.

What is the top reported ARC-AGI-2 score?

GPT-5.5 has the top reported score on ARC-AGI-2: 85% (accuracy).

Why do ARC-AGI-2 scores differ across runs?

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Does evals.report rank models across benchmarks?

No. ARC-AGI-2 scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

BenchmarksReasoning

ARC-AGI-2

The ARC-AGI-2 abstract-reasoning puzzle benchmark (semi-private set), the harder static successor to ARC-AGI-1.

ReasoningaccuracyHigher is better

Scores About Run this benchmark

ARC-AGI-2 ships its public eval set (120 tasks) and training set (1000 tasks) as JSON in the arcprize/ARC-AGI-2 repo, but that repo has no harness, no scoring script, and no apps/ folder. The official arcprize/arc-agi-benchmarking harness runs your own model (configured in src/arc_agi_benchmarking/models.yml via provider adapters) over those task JSONs and scores them with exact-match correctness. You generate predictions with cli/run_all.py, then grade with scoring/scoring.py whose --task_dir points at the source taskset that holds the solutions. Keep attached to any score: which split (public_eval vs training), the model config name/version, num_attempts (ARC allows 2 trials per test input), and the arc-agi-benchmarking commit, since only the public eval set is runnable locally (semi-private/private sets are leaderboard-gated and not in the repo).

Benchmark

ARC-AGI-2

Repository

github.com/arcprize/arc-agi-benchmarking

Dataset

github.com/arcprize/ARC-AGI-2

Metric

accuracy

1Install

shell

git clone https://github.com/arcprize/arc-agi-benchmarking.git

shell

# from inside the cloned arc-agi-benchmarking directory:
uv sync

shell

cp .env.example .env   # then fill in provider keys, e.g. OPENAI_API_KEY / ANTHROPIC_API_KEY / GEMINI_API_KEY

shell

git clone https://github.com/arcprize/ARC-AGI-2.git data/arc-agi

2Run evaluation

shell

# (optional) smoke test the pipeline with the bundled sample tasks and the random baseline:
uv run cli/run_all.py --config random-baseline --data_dir data/sample/tasks --save_submission_dir submissions/random-baseline-sample --log-level INFO

shell

# Real run over the ARC-AGI-2 public evaluation set with your model config (add your model to src/arc_agi_benchmarking/models.yml, then pass its name via --config; set --num_attempts 2 to match the ARC 2-trials protocol):
uv run cli/run_all.py --config <your-model-config> --data_dir data/arc-agi/data/evaluation --save_submission_dir submissions/<your-model-config> --num_attempts 2 --log-level INFO

shell

# Single-task debug run:
uv run main.py --data_dir data/sample/tasks --config random-baseline --task_id 66e6c45b --save_submission_dir submissions/random-single --log-level INFO

3Score output

shell

uv run src/arc_agi_benchmarking/scoring/scoring.py --task_dir data/arc-agi/data/evaluation --submission_dir submissions/<your-model-config> --results_dir results/<your-model-config>

4Expected output

cli/run_all.py writes per-task prediction JSONs into the --save_submission_dir (README-recommended layout <save_submission_dir>/<config>/<version>/<eval_type>/, e.g. submissions/gpt-4o-2024-11-20/v1/public_eval/). scoring.py reads those plus the source taskset (which contains the solutions) and writes aggregate results (results.json) into --results_dir, reporting exact-match accuracy: a task counts as solved only when the predicted output grid matches the validated solution exactly in shape, color, and position. This is public-eval-set accuracy; do not compare it against semi-private/private leaderboard numbers.

5Submit results

ARC-AGI-2 has no automated public submission for local runs against your own model: the official leaderboard at arcprize.org/leaderboard verifies semi-private/private sets that are not in the repo. For self-reported public-eval results you can publish your code/scores (e.g. via the ARC Prize community leaderboard, which expects a link to a public code repo and a public-set score); such scores are self-reported and unverified by ARC Prize. Always attach: split (public_eval), model config name + version, num_attempts (ARC permits 2 trials per test input), and the arc-agi-benchmarking commit.

Gotchas

The arcprize/ARC-AGI-2 repo itself has NO harness, scoring script, or apps/ folder - it only ships task JSONs (data/training = 1000 tasks, data/evaluation = 120 tasks) and points to the ARC-AGI-1 browser UI for manual play. Use arcprize/arc-agi-benchmarking for any programmatic evaluation.

Only the 120-task public evaluation set is runnable locally. The semi-private and fully-private sets used for the official leaderboard are NOT in the repo, so a local public-eval score is not directly comparable to leaderboard figures (calibrated to similar difficulty but measured on different withheld tasks).

scoring.py needs --task_dir pointed at the SOURCE taskset that still contains the solution outputs (data/arc-agi/data/evaluation), not at your submission dir; the submission dir holds only your predictions. The harness clones ARC-AGI-2 into data/arc-agi, so the eval JSONs live at data/arc-agi/data/evaluation.

ARC scoring is exact-match under the 2-trials rule - set --num_attempts 2 to match the official protocol; a near-miss grid scores zero, so partial/pixel-correctness is not the headline accuracy.

scoring.py uses --print_logs (underscore) not --print-logs; run_all.py exposes --num_attempts and --retry_attempts. Each run_all invocation handles one model config, so run multiple configs by invoking it repeatedly.