evals.report
BenchmarksSourcesLabsCompareRun guides

Terminal-Bench 2.1

A command-line agent benchmark for completing terminal tasks in reproducible task environments.

Agentstask successHigher is better
ModelLabScoreSource modelStatusDate
Codex CLI + GPT-5.5Agent systems83.4%Codex CLI + GPT-5.5VerifiedMay 1, 2026Details
Claude Code + Claude Opus 4.8Agent systems78.9%Claude Code + Claude Opus 4.8VerifiedMay 29, 2026Details
Terminus 2 + GPT-5.5Agent systems78.2%Terminus 2 + GPT-5.5VerifiedMay 1, 2026Details
Terminus 2 + Claude Opus 4.8Agent systems74.6%Terminus 2 + Claude Opus 4.8VerifiedMay 29, 2026Details
Terminus 2 + Gemini 3 ProAgent systems74.4%Terminus 2 + Gemini 3 ProVerifiedMay 1, 2026Details
Gemini CLI + Gemini 3.1 ProAgent systems70.7%Gemini CLI + Gemini 3.1 ProVerifiedMay 5, 2026Details
Terminus 2 + Gemini 3.1 ProAgent systems70.3%Terminus 2 + Gemini 3.1 ProVerifiedMay 5, 2026Details
Claude Code + Claude Opus 4.7Agent systems69.7%Claude Code + Claude Opus 4.7VerifiedMay 1, 2026Details
Gemini CLI + Gemini 3 ProAgent systems66.3%Gemini CLI + Gemini 3 ProVerifiedMay 2, 2026Details
Terminus 2 + Claude Opus 4.7Agent systems66.1%Terminus 2 + Claude Opus 4.7VerifiedMay 1, 2026Details
Claude Code + GLM 5.1Agent systems58.7%Claude Code + GLM 5.1VerifiedMay 2, 2026Details

Each row reports the model’s task success on Terminal-Bench 2.1. Click a row for the full run context.