What does task success mean on Terminal-Bench 2.1?

Terminal-Bench 2.1 reports task success (%); higher is better. Scores are shown only within Terminal-Bench 2.1 and are never averaged with other benchmarks.

What is the top reported Terminal-Bench 2.1 score?

GPT-5.6 Sol Ultra has the top reported score on Terminal-Bench 2.1: 91.9% (task success).

Why do Terminal-Bench 2.1 scores differ across runs?

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Does evals.report rank models across benchmarks?

No. Terminal-Bench 2.1 scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

BenchmarksAgents

Terminal-Bench 2.1

A command-line agent benchmark for completing terminal tasks in reproducible task environments.

Agentstask successHigher is better

Scores About Run this benchmark

Terminal-Bench evaluates AI agents on real terminal/command-line tasks inside sandboxed Docker containers. Terminal-Bench 2.0 is run with the official Harbor harness (`harbor` CLI), which launches tasks locally via Docker or in the cloud via Daytona/Modal; the original Terminal-Bench-Core v0.1.1 is still runnable with the legacy `tb` CLI from the `terminal-bench` package. Both harnesses are public, locally runnable, and reproducible against your own model. When reporting a score, keep attached: harness (harbor vs tb), exact dataset+version (terminal-bench@2.0 / terminal-bench-core 0.1.1), agent scaffold, model id, number of trials, and execution env (local Docker vs daytona).

Benchmark

Terminal-Bench 2.1

Repository

github.com/laude-institute/harbor

Dataset

huggingface.co/datasets/harborframework/terminal-bench-2-leaderboard

Metric

task success

1Install

shell

# Requires Docker installed and running (and uv).
uv tool install harbor

shell

# or: pip install harbor
# Legacy Terminal-Bench-Core v0.1.1 harness (separate package/CLI):
# uv tool install terminal-bench   (or: pip install terminal-bench)

2Run evaluation

shell

# Sanity-check the install with the oracle (reference solver) agent:
harbor run -d terminal-bench/terminal-bench-2 -a oracle

shell

# Run YOUR model/agent against Terminal-Bench 2.0 locally (Docker):
export ANTHROPIC_API_KEY=<YOUR-KEY>

shell

harbor run --dataset terminal-bench@2.0 --agent claude-code --model anthropic/claude-opus-4-1 --n-concurrent 4

shell

# Scale out on a cloud provider (Daytona) for high concurrency:
export DAYTONA_API_KEY=<YOUR-KEY>

shell

harbor run --dataset terminal-bench@2.0 --agent claude-code --model anthropic/claude-opus-4-1 --n-concurrent 100 --env daytona

shell

# Discover supported agents/options and available datasets:
harbor run --help

shell

harbor datasets list

shell

# LEGACY Terminal-Bench-Core v0.1.1 (different task set, `tb` CLI):
tb run --agent terminus --model anthropic/claude-3-7-latest --dataset-name terminal-bench-core --dataset-version 0.1.1 --n-concurrent 8

3Expected output

Each task runs in a sandboxed container; the per-task test script determines pass/fail and a per-trial result.json is written (trial directories live under the submission/job folder). The aggregate metric is task success rate (fraction of tasks resolved). For a valid leaderboard run, evaluate each task with a minimum of five trials (the `-k 5` flag is recommended) and keep timeout_multiplier=1.0 (no resource/timeout overrides). Do not mix Terminal-Bench 2.0 (harbor, terminal-bench@2.0) scores with Terminal-Bench-Core v0.1.1 (tb, terminal-bench-core 0.1.1) scores — they are different task sets.

4Submit results

View the public leaderboard at tbench.ai/leaderboard. To submit Terminal-Bench 2.0 results, fork the HuggingFace dataset (harborframework/terminal-bench-2-leaderboard), create a branch, add run artifacts under submissions/terminal-bench/2.0/<agent>__<model(s)>/ including metadata.yaml (agent+model info), config.json per job, and per-trial result.json files (min 5 trials via -k 5, timeout_multiplier=1.0, agents must NOT access the Terminal-Bench website or GitHub repo), then open a PR — a bot validates and a maintainer merges. NOTE: at audit time the dataset card said 'SUBMISSIONS CLOSED' pending a new submission process/integrity policy (check back per the dataset card for current status). Always report harness version, dataset+version, agent scaffold, model id, trial count, and execution env alongside any number. Legacy Terminal-Bench-Core v0.1.1 is submitted separately via the tbench.ai leaderboard guide using the `tb` CLI.

Gotchas

Two harnesses/datasets coexist: use `harbor` + `terminal-bench@2.0` (also written `terminal-bench/terminal-bench-2`) for Terminal-Bench 2.0, and the legacy `tb` CLI (pip install terminal-bench) with `--dataset-name terminal-bench-core --dataset-version 0.1.1` only for the older v0.1.1 leaderboard. Do not conflate their scores.

Docker must be installed and running for local execution; for large concurrency (--n-concurrent 100) you need a cloud provider via `--env daytona` (requires DAYTONA_API_KEY) or Modal, not just local Docker.

Run the `oracle` agent first (`harbor run -d terminal-bench/terminal-bench-2 -a oracle`) to verify your install before spending API tokens on a real model.

Leaderboard validity rules are strict: minimum 5 trials per task (use `-k 5`), timeout_multiplier must equal 1.0, no resource/timeout overrides, and the agent must not be allowed to access the Terminal-Bench website or GitHub (anti reward-hacking). As of audit, submissions were marked CLOSED on the dataset card pending a new process.