evals.report
BenchmarksSourcesLabsCompareRun guides

Terminal-Bench 2.1

A command-line agent benchmark for completing terminal tasks in reproducible task environments.

Agentstask successHigher is better

Known official sources 1

Ready nowHF datasetReview neededRun guide readyPublic data

Terminal-Bench 2.1

Important command-line agent benchmark with task registry and adapter-sensitive results.

Category
Agents
Owner
Harbor / Laude Institute
Data path
Use page and HF rows with agent name, model, and task-set version kept separate.
Known caveat
A score cell must show agent scaffold and harness, not just model.