evals.report
BenchmarksSourcesLabsCompareRun guides
SourcesAgents

Terminal-Bench 2.1

Important command-line agent benchmark with task registry and adapter-sensitive results.

Ready nowHF datasetReview neededRun guide readyPublic data
Official source Benchmark page

Source detail

Score source

Official leaderboard links a public HF submissions dataset.

Run guide

Official repo includes tasks, Docker setup, adapters, and registry.

How it can be used

Use page and HF rows with agent name, model, and task-set version kept separate.

Caveat

A score cell must show agent scaffold and harness, not just model.

Evidence links 3