SourcesAgents
Terminal-Bench 2.1
Important command-line agent benchmark with task registry and adapter-sensitive results.
Ready nowHF datasetReview neededRun guide readyPublic data
Source detail
Score source
Official leaderboard links a public HF submissions dataset.
Run guide
Official repo includes tasks, Docker setup, adapters, and registry.
How it can be used
Use page and HF rows with agent name, model, and task-set version kept separate.
Caveat
A score cell must show agent scaffold and harness, not just model.