BenchmarksAgents
SWE-Marathon
A long-horizon software-engineering benchmark of 20 realistic, multi-hour tasks (library reproductions, full-stack product clones, ML-engineering, and algorithmic optimization) that test whether frontier coding agents can autonomously complete ultra-long-horizon work; scored by binary pass@1 resolution rate with reward-hacking-resistant verifiers.
Agentsresolution rate (pass@1)Higher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Code + Claude Opus 4.8 | Agent systems | 26.0% | Claude Code + Claude Opus 4.8 | Official | — | Details |
| Claude Code + Claude Opus 4.7 | Agent systems | 16.0% | Claude Code + Claude Opus 4.7 | Official | — | Details |
| Codex CLI + GPT-5.5 | Agent systems | 12.0% | Codex CLI + GPT-5.5 | Official | — | Details |
| Terminus 2 + Claude Opus 4.7 | Agent systems | 11.0% | Terminus 2 + Claude Opus 4.7 | Official | — | Details |
| Gemini CLI + Gemini 3.5 Flash | Agent systems | 7.0% | Gemini CLI + Gemini 3.5 Flash | Official | — | Details |
| Terminus 2 + GPT-5.5 | Agent systems | 6.0% | Terminus 2 + GPT-5.5 | Official | — | Details |
| Terminus 2 + Gemini 3.1 Pro | Agent systems | 4.0% | Terminus 2 + Gemini 3.1 Pro | Official | — | Details |
| Terminus 2 + DeepSeek V4 Pro | Agent systems | 4.0% | Terminus 2 + DeepSeek V4 Pro | Official | — | Details |
| Gemini CLI + Gemini 3.1 Pro | Agent systems | 2.0% | Gemini CLI + Gemini 3.1 Pro | Official | — | Details |
| Terminus 2 + GLM 5.1 | Agent systems | 1.0% | Terminus 2 + GLM 5.1 | Official | — | Details |
| Terminus 2 + MiniMax M2.7 | Agent systems | 0.0% | Terminus 2 + MiniMax M2.7 | Official | — | Details |
| Kimi Code CLI + Kimi K2.6 | Agent systems | 0.0% | Kimi Code CLI + Kimi K2.6 | Official | — | Details |
| Terminus 2 + Kimi K2.6 | Agent systems | 0.0% | Terminus 2 + Kimi K2.6 | Official | — | Details |
Each row reports the model’s resolution rate (pass@1) on SWE-Marathon. Click a row for the full run context.