evals.report
BenchmarksLabsCompareRun guides

Terminal-Bench 2.0

An agentic benchmark measuring whether an AI model can complete real command-line / terminal software tasks end-to-end (version 2.0, the 89-task set), scored by task success rate. Distinct from the newer Terminal-Bench 2.1 (a different task set); most 2026 model cards self-report this 2.0 version.

Agentstask successHigher is better

No run guide for this benchmark yet.