evals.report
BenchmarksLabsCompareRun guides

SWE-Marathon

A long-horizon software-engineering benchmark of 20 realistic, multi-hour tasks (library reproductions, full-stack product clones, ML-engineering, and algorithmic optimization) that test whether frontier coding agents can autonomously complete ultra-long-horizon work; scored by binary pass@1 resolution rate with reward-hacking-resistant verifiers.

Agentsresolution rate (pass@1)Higher is better
ModelLabScoreSource modelStatusDate
Claude Code + Claude Opus 4.8Agent systems26.0%Claude Code + Claude Opus 4.8OfficialDetails
Claude Code + Claude Opus 4.7Agent systems16.0%Claude Code + Claude Opus 4.7OfficialDetails
Codex CLI + GPT-5.5Agent systems12.0%Codex CLI + GPT-5.5OfficialDetails
Terminus 2 + Claude Opus 4.7Agent systems11.0%Terminus 2 + Claude Opus 4.7OfficialDetails
Gemini CLI + Gemini 3.5 FlashAgent systems7.0%Gemini CLI + Gemini 3.5 FlashOfficialDetails
Terminus 2 + GPT-5.5Agent systems6.0%Terminus 2 + GPT-5.5OfficialDetails
Terminus 2 + Gemini 3.1 ProAgent systems4.0%Terminus 2 + Gemini 3.1 ProOfficialDetails
Terminus 2 + DeepSeek V4 ProAgent systems4.0%Terminus 2 + DeepSeek V4 ProOfficialDetails
Gemini CLI + Gemini 3.1 ProAgent systems2.0%Gemini CLI + Gemini 3.1 ProOfficialDetails
Terminus 2 + GLM 5.1Agent systems1.0%Terminus 2 + GLM 5.1OfficialDetails
Terminus 2 + MiniMax M2.7Agent systems0.0%Terminus 2 + MiniMax M2.7OfficialDetails
Kimi Code CLI + Kimi K2.6Agent systems0.0%Kimi Code CLI + Kimi K2.6OfficialDetails
Terminus 2 + Kimi K2.6Agent systems0.0%Terminus 2 + Kimi K2.6OfficialDetails

Each row reports the model’s resolution rate (pass@1) on SWE-Marathon. Click a row for the full run context.