evals.report
BenchmarksSourcesLabsCompareRun guides

Terminal-Bench 2.1

A command-line agent benchmark for completing terminal tasks in reproducible task environments.

Agentstask successHigher is better

What this benchmark measures

A command-line agent benchmark for completing terminal tasks in reproducible task environments.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is task success. It should be interpreted within Terminal-Bench 2.1 and the Harbor / Laude Institute source context, not compared as part of a site-wide ranking.

What to be careful about

A score cell must show agent scaffold and harness, not just model.

No composite ranking
evals.report never combines benchmarks. task success on Terminal-Bench 2.1 is its own number — don’t average it with other metrics.