Terminal-Bench 2.0
An agentic benchmark measuring whether an AI model can complete real command-line / terminal software tasks end-to-end (version 2.0, the 89-task set), scored by task success rate. Distinct from the newer Terminal-Bench 2.1 (a different task set); most 2026 model cards self-report this 2.0 version.
What this benchmark measures
An agentic benchmark measuring whether an AI model can complete real command-line / terminal software tasks end-to-end (version 2.0, the 89-task set), scored by task success rate. Distinct from the newer Terminal-Bench 2.1 (a different task set); most 2026 model cards self-report this 2.0 version.
Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.
The metric shown here is task success. It should be interpreted within Terminal-Bench 2.0, not compared as part of a site-wide ranking.