How to run ARC-AGI-3 — benchmark guide

Run guidesReasoning

ARC-AGI-3 is an interactive reasoning benchmark of novel grid-based games played through the ARC-AGI-3 API. You run the official ARC-AGI-3-Agents harness with uv: clone the repo, copy .env.example to .env, set ARC_API_KEY (and OPENAI_API_KEY + the MODEL attribute for LLM agents), then run `uv run main.py --agent=<agent> --game=<game>`. The harness opens a scorecard and on exit logs a scorecard report (JSON) plus an online scorecard URL; per-game progress is tracked via levels_completed/win_levels. Keep attached to any score: harness version (changelog is at 0.9.3), the exact agent class + MODEL used, OPERATION_MODE, and which game(s)/tags you ran.

Benchmark

ARC-AGI-3

Repository

github.com/arcprize/ARC-AGI-3-Agents

Dataset

three.arcprize.org

Metric

accuracy

1Install

shell

# Install uv first: https://docs.astral.sh/uv/getting-started/installation/
git clone https://github.com/arcprize/ARC-AGI-3-Agents.git

shell

cd ARC-AGI-3-Agents

shell

cp .env.example .env

2Run evaluation

shell

# 1) Get an ARC API key from https://three.arcprize.org/ and put it in .env: ARC_API_KEY=your_api_key_here (header is X-API-Key)
# 2) Smoke test with the built-in random agent against the ls20 game:
uv run main.py --agent=random --game=ls20

shell

# 3) To evaluate your own model, set OPENAI_API_KEY in .env and (optionally) edit the MODEL attribute on the LLM agent class in agents/templates/llm_agents.py (default MODEL='gpt-4o-mini'), then run an LLM agent:
uv run main.py --agent=llm --game=ls20

shell

# Other --agent values are the lowercased class names registered in AVAILABLE_AGENTS (agents/__init__.py), e.g. fastllm, reasoningllm, guidedllm, multimodalllm, reasoningagent, openclaw, plus langgraph* and smol* variants.
# Omit --game to run a swarm across ALL available games (many more API calls); --game accepts comma-separated prefixes; use --tags 'experiment,v1.0' to label the scorecard.

3Expected output

The harness prints the API games URL and the game list, runs the agent against the API, and on completion (cleanup on SIGINT) logs an '--- EXISTING SCORECARD REPORT ---' JSON plus 'View your scorecard online: {ROOT_URL}/scorecards/{card_id}' (e.g. https://three.arcprize.org/scorecards/<card_id>). Per-game progress is tracked in FrameData via levels_completed and win_levels (renamed from score/win_score in v0.9.3). Logs are also written to logs.log. Report scores per game; do not average across heterogeneous games or mix harness versions.

4Submit results

Scores are produced as an ARC-AGI-3 scorecard (online URL printed by the harness). To enter the competition, submit your agent via the official form referenced in the README under 'Contest Submission': https://forms.gle/wMLZrEFGDh33DhzV9. When reporting a number, attach: harness commit/version (changelog at 0.9.3), the exact --agent class and the LLM MODEL used, OPERATION_MODE (.env), the game_id(s) and --tags, and the scorecard URL.

Gotchas

The selectable --agent values are the lowercased class names from agents/__init__.py (random, llm, fastllm, reasoningllm, guidedllm, multimodalllm, reasoningagent, openclaw, langgraph*, smol*) plus any recording .recording.jsonl files; there is NO generic --model CLI flag. To change the LLM, edit the agent class's MODEL attribute (LLM/FastLLM default 'gpt-4o-mini'; ReasoningLLM 'o4-mini'; GuidedLLM 'o3') in agents/templates/ and set OPENAI_API_KEY in .env.

main.py loads .env.example first, then .env with override. The .env.example ships SCHEME=https/HOST=three.arcprize.org/PORT=443 and OPERATION_MODE=online, so the effective default server is https://three.arcprize.org. The README changelog mentions setting ONLINE_ONLY=True to keep using the online API/Replays, but that key is NOT present in the shipped .env.example (only OPERATION_MODE is) — check your harness version before relying on it.

--game is a prefix filter (e.g. 'ls20' matches game_ids starting with ls20, and it accepts comma-separated prefixes); omitting --game makes an agent swarm play ALL available games, which consumes far more API calls. The LLM agent caps episodes at MAX_ACTIONS=80 by default.

This is a moving benchmark (changelog shows breaking FrameData field renames across 0.9.1/0.9.2/0.9.3, e.g. score->levels_completed); pin the harness version. Scores depend heavily on the agent scaffold, so two runs are only comparable with the same agent class + MODEL + harness version. requires-python >=3.12; the harness depends on the 'arc-agi' package (imported as 'arcengine').