BenchmarksAgents
Terminal-Bench 2.0
An agentic benchmark measuring whether an AI model can complete real command-line / terminal software tasks end-to-end (version 2.0, the 89-task set), scored by task success rate. Distinct from the newer Terminal-Bench 2.1 (a different task set); most 2026 model cards self-report this 2.0 version.
Agentstask successHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Mythos Preview | Anthropic | 82.0% | — | Unverified | Apr 7, 2026 | Details |
| GPT-5.3-Codex | OpenAI | 77.3% | — | Verified | Feb 5, 2026 | Details |
| GPT-5.4 | OpenAI | 75.1% | — | Verified | Mar 5, 2026 | Details |
| Qwen3.7 Max Preview | Alibaba / Qwen | 69.7% | — | Unverified | May 14, 2026 | Details |
| MiMo-V2.5-Pro | Xiaomi | 68.4% | — | Verified | Apr 22, 2026 | Details |
| DeepSeek V4 Pro | DeepSeek | 67.9% | — | Verified | Apr 24, 2026 | Details |
| Kimi K2.6 | Moonshot AI | 66.7% | — | Verified | Apr 20, 2026 | Details |
| Claude Opus 4.6 | Anthropic | 65.4% | — | Verified | Feb 5, 2026 | Details |
| GLM-5.1 | Z.ai | 63.5% | — | Verified | Apr 7, 2026 | Details |
| Claude Sonnet 4.6 | Anthropic | 59.1% | — | Verified | Feb 17, 2026 | Details |
| MiniMax M2.7 | MiniMax | 57.0% | — | Verified | Mar 18, 2026 | Details |
| DeepSeek V4 Flash | DeepSeek | 56.9% | — | Verified | Apr 24, 2026 | Details |
| Doubao Seed 2.0 Pro | ByteDance | 55.8% | — | Verified | Feb 14, 2026 | Details |
| Qwen3.5-397B-A17B | Alibaba / Qwen | 52.5% | — | Verified | Feb 16, 2026 | Details |
| MAI-Thinking-1 | Microsoft AI | 46.0% | — | Verified | Jun 2, 2026 | Details |
| GLM-4.7 | Z.ai | 41.0% | — | Verified | Dec 22, 2025 | Details |
| DeepSeek V3.2 | DeepSeek | 39.6% | — | Official | Dec 1, 2025 | Details |
| Kimi K2 Thinking | Moonshot AI | 35.7% | — | Official | Nov 6, 2025 | Details |
Each row reports the model’s task success on Terminal-Bench 2.0. Click a row for the full run context.