Run guidesAgents
Run Terminal-Bench 2.1
The same run guide is also available from the benchmark detail page.
Agentstask success
Official repo includes tasks, Docker setup, adapters, and registry.
1Expected output
Use the official source links for current output format, submission steps, and benchmark-specific result files.
2Submit results
Keep source URL, source model name, benchmark version, harness, and run context attached to any reported score.
Gotchas
A score cell must show agent scaffold and harness, not just model.
Do not mix this benchmark's metric with unrelated benchmark metrics.