evals.report
BenchmarksLabsCompareRun guides

Terminal-Bench 2.0

An agentic benchmark measuring whether an AI model can complete real command-line / terminal software tasks end-to-end (version 2.0, the 89-task set), scored by task success rate. Distinct from the newer Terminal-Bench 2.1 (a different task set); most 2026 model cards self-report this 2.0 version.

Agentstask successHigher is better

What this benchmark measures

An agentic benchmark measuring whether an AI model can complete real command-line / terminal software tasks end-to-end (version 2.0, the 89-task set), scored by task success rate. Distinct from the newer Terminal-Bench 2.1 (a different task set); most 2026 model cards self-report this 2.0 version.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is task success. It should be interpreted within Terminal-Bench 2.0, not compared as part of a site-wide ranking.

No composite ranking
evals.report never combines benchmarks. task success on Terminal-Bench 2.0 is its own number — don’t average it with other metrics.