BenchmarksAgents
Terminal-Bench 2.1
A command-line agent benchmark for completing terminal tasks in reproducible task environments.
Agentstask successHigher is better
Official repo includes tasks, Docker setup, adapters, and registry.
1Expected output
Use the official source links for current output format, submission steps, and benchmark-specific result files.
2Submit results
Keep source URL, source model name, benchmark version, harness, and run context attached to any reported score.
Gotchas
A score cell must show agent scaffold and harness, not just model.
Do not mix this benchmark's metric with unrelated benchmark metrics.