Terminal-Bench 2.1

Name: Terminal-Bench 2.1
Creator: evals.report

A command-line agent benchmark for completing terminal tasks in reproducible task environments.

Agentstask successHigher is better

Known official sources 1

Ready nowHF datasetReview neededRun guide readyPublic data

Important command-line agent benchmark with task registry and adapter-sensitive results.

Category: Agents
Owner: Harbor / Laude Institute
Data path: Use page and HF rows with agent name, model, and task-set version kept separate.

Known caveat

A score cell must show agent scaffold and harness, not just model.