BenchmarksAgents
Terminal-Bench 2.0
An agentic benchmark measuring whether an AI model can complete real command-line / terminal software tasks end-to-end (version 2.0, the 89-task set), scored by task success rate. Distinct from the newer Terminal-Bench 2.1 (a different task set); most 2026 model cards self-report this 2.0 version.
Agentstask successHigher is better
No run guide for this benchmark yet.