evals.report
BenchmarksSourcesLabsCompareRun guides

Official repo includes tasks, Docker setup, adapters, and registry.

Benchmark
Terminal-Bench 2.1
Dataset
huggingface.co/datasets/harborframework/terminal-bench-2-leaderboard
Metric
task success

1Expected output

Use the official source links for current output format, submission steps, and benchmark-specific result files.

2Submit results

Keep source URL, source model name, benchmark version, harness, and run context attached to any reported score.

Gotchas

A score cell must show agent scaffold and harness, not just model.
Do not mix this benchmark's metric with unrelated benchmark metrics.