evals.report
BenchmarksLabsCompareRun guides

Terminal-Bench 2.0

An agentic benchmark measuring whether an AI model can complete real command-line / terminal software tasks end-to-end (version 2.0, the 89-task set), scored by task success rate. Distinct from the newer Terminal-Bench 2.1 (a different task set); most 2026 model cards self-report this 2.0 version.

Agentstask successHigher is better
ModelLabScoreSource modelStatusDate
Claude Mythos PreviewAnthropic82.0%UnverifiedApr 7, 2026Details
GPT-5.3-CodexOpenAI77.3%VerifiedFeb 5, 2026Details
GPT-5.4OpenAI75.1%VerifiedMar 5, 2026Details
Qwen3.7 Max PreviewAlibaba / Qwen69.7%UnverifiedMay 14, 2026Details
MiMo-V2.5-ProXiaomi68.4%VerifiedApr 22, 2026Details
DeepSeek V4 ProDeepSeek67.9%VerifiedApr 24, 2026Details
Kimi K2.6Moonshot AI66.7%VerifiedApr 20, 2026Details
Claude Opus 4.6Anthropic65.4%VerifiedFeb 5, 2026Details
GLM-5.1Z.ai63.5%VerifiedApr 7, 2026Details
Claude Sonnet 4.6Anthropic59.1%VerifiedFeb 17, 2026Details
MiniMax M2.7MiniMax57.0%VerifiedMar 18, 2026Details
DeepSeek V4 FlashDeepSeek56.9%VerifiedApr 24, 2026Details
Doubao Seed 2.0 ProByteDance55.8%VerifiedFeb 14, 2026Details
Qwen3.5-397B-A17BAlibaba / Qwen52.5%VerifiedFeb 16, 2026Details
MAI-Thinking-1Microsoft AI46.0%VerifiedJun 2, 2026Details
GLM-4.7Z.ai41.0%VerifiedDec 22, 2025Details
DeepSeek V3.2DeepSeek39.6%OfficialDec 1, 2025Details
Kimi K2 ThinkingMoonshot AI35.7%OfficialNov 6, 2025Details

Each row reports the model’s task success on Terminal-Bench 2.0. Click a row for the full run context.