How to run Terminal-Bench 2.1 — benchmark guide

Run guidesAgents

Terminal-Bench evaluates AI agents on real terminal/command-line tasks inside sandboxed Docker containers. Terminal-Bench 2.0 is run with the official Harbor harness (`harbor` CLI), which launches tasks locally via Docker or in the cloud via Daytona/Modal; the original Terminal-Bench-Core v0.1.1 is still runnable with the legacy `tb` CLI from the `terminal-bench` package. Both harnesses are public, locally runnable, and reproducible against your own model. When reporting a score, keep attached: harness (harbor vs tb), exact dataset+version (terminal-bench@2.0 / terminal-bench-core 0.1.1), agent scaffold, model id, number of trials, and execution env (local Docker vs daytona).

Benchmark

Terminal-Bench 2.1

Repository

github.com/laude-institute/harbor

Dataset

huggingface.co/datasets/harborframework/terminal-bench-2-leaderboard

Metric

task success

1Install

shell

# Requires Docker installed and running (and uv).
uv tool install harbor

shell

# or: pip install harbor
# Legacy Terminal-Bench-Core v0.1.1 harness (separate package/CLI):
# uv tool install terminal-bench   (or: pip install terminal-bench)

2Run evaluation

shell

# Sanity-check the install with the oracle (reference solver) agent:
harbor run -d terminal-bench/terminal-bench-2 -a oracle

shell

# Run YOUR model/agent against Terminal-Bench 2.0 locally (Docker):
export ANTHROPIC_API_KEY=<YOUR-KEY>

shell

harbor run --dataset terminal-bench@2.0 --agent claude-code --model anthropic/claude-opus-4-1 --n-concurrent 4

shell

# Scale out on a cloud provider (Daytona) for high concurrency:
export DAYTONA_API_KEY=<YOUR-KEY>

shell

harbor run --dataset terminal-bench@2.0 --agent claude-code --model anthropic/claude-opus-4-1 --n-concurrent 100 --env daytona

shell

# Discover supported agents/options and available datasets:
harbor run --help

shell

harbor datasets list

shell

# LEGACY Terminal-Bench-Core v0.1.1 (different task set, `tb` CLI):
tb run --agent terminus --model anthropic/claude-3-7-latest --dataset-name terminal-bench-core --dataset-version 0.1.1 --n-concurrent 8

3Expected output

Each task runs in a sandboxed container; the per-task test script determines pass/fail and a per-trial result.json is written (trial directories live under the submission/job folder). The aggregate metric is task success rate (fraction of tasks resolved). For a valid leaderboard run, evaluate each task with a minimum of five trials (the `-k 5` flag is recommended) and keep timeout_multiplier=1.0 (no resource/timeout overrides). Do not mix Terminal-Bench 2.0 (harbor, terminal-bench@2.0) scores with Terminal-Bench-Core v0.1.1 (tb, terminal-bench-core 0.1.1) scores — they are different task sets.

4Submit results

View the public leaderboard at tbench.ai/leaderboard. To submit Terminal-Bench 2.0 results, fork the HuggingFace dataset (harborframework/terminal-bench-2-leaderboard), create a branch, add run artifacts under submissions/terminal-bench/2.0/<agent>__<model(s)>/ including metadata.yaml (agent+model info), config.json per job, and per-trial result.json files (min 5 trials via -k 5, timeout_multiplier=1.0, agents must NOT access the Terminal-Bench website or GitHub repo), then open a PR — a bot validates and a maintainer merges. NOTE: at audit time the dataset card said 'SUBMISSIONS CLOSED' pending a new submission process/integrity policy (check back per the dataset card for current status). Always report harness version, dataset+version, agent scaffold, model id, trial count, and execution env alongside any number. Legacy Terminal-Bench-Core v0.1.1 is submitted separately via the tbench.ai leaderboard guide using the `tb` CLI.

Gotchas

Two harnesses/datasets coexist: use `harbor` + `terminal-bench@2.0` (also written `terminal-bench/terminal-bench-2`) for Terminal-Bench 2.0, and the legacy `tb` CLI (pip install terminal-bench) with `--dataset-name terminal-bench-core --dataset-version 0.1.1` only for the older v0.1.1 leaderboard. Do not conflate their scores.

Docker must be installed and running for local execution; for large concurrency (--n-concurrent 100) you need a cloud provider via `--env daytona` (requires DAYTONA_API_KEY) or Modal, not just local Docker.

Run the `oracle` agent first (`harbor run -d terminal-bench/terminal-bench-2 -a oracle`) to verify your install before spending API tokens on a real model.

Leaderboard validity rules are strict: minimum 5 trials per task (use `-k 5`), timeout_multiplier must equal 1.0, no resource/timeout overrides, and the agent must not be allowed to access the Terminal-Bench website or GitHub (anti reward-hacking). As of audit, submissions were marked CLOSED on the dataset card pending a new process.