BenchmarksAgents
Terminal-Bench 2.1
A command-line agent benchmark for completing terminal tasks in reproducible task environments.
Agentstask successHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Codex CLI + GPT-5.5 | Agent systems | 83.4% | Codex CLI + GPT-5.5 | Verified | May 1, 2026 | Details |
| Claude Code + Claude Opus 4.8 | Agent systems | 78.9% | Claude Code + Claude Opus 4.8 | Verified | May 29, 2026 | Details |
| Terminus 2 + GPT-5.5 | Agent systems | 78.2% | Terminus 2 + GPT-5.5 | Verified | May 1, 2026 | Details |
| Terminus 2 + Claude Opus 4.8 | Agent systems | 74.6% | Terminus 2 + Claude Opus 4.8 | Verified | May 29, 2026 | Details |
| Terminus 2 + Gemini 3 Pro | Agent systems | 74.4% | Terminus 2 + Gemini 3 Pro | Verified | May 1, 2026 | Details |
| Gemini CLI + Gemini 3.1 Pro | Agent systems | 70.7% | Gemini CLI + Gemini 3.1 Pro | Verified | May 5, 2026 | Details |
| Terminus 2 + Gemini 3.1 Pro | Agent systems | 70.3% | Terminus 2 + Gemini 3.1 Pro | Verified | May 5, 2026 | Details |
| Claude Code + Claude Opus 4.7 | Agent systems | 69.7% | Claude Code + Claude Opus 4.7 | Verified | May 1, 2026 | Details |
| Gemini CLI + Gemini 3 Pro | Agent systems | 66.3% | Gemini CLI + Gemini 3 Pro | Verified | May 2, 2026 | Details |
| Terminus 2 + Claude Opus 4.7 | Agent systems | 66.1% | Terminus 2 + Claude Opus 4.7 | Verified | May 1, 2026 | Details |
| Claude Code + GLM 5.1 | Agent systems | 58.7% | Claude Code + GLM 5.1 | Verified | May 2, 2026 | Details |
Each row reports the model’s task success on Terminal-Bench 2.1. Click a row for the full run context.