evals.report
BenchmarksLabsCompareRun guides

SWE-Marathon

A long-horizon software-engineering benchmark of 20 realistic, multi-hour tasks (library reproductions, full-stack product clones, ML-engineering, and algorithmic optimization) that test whether frontier coding agents can autonomously complete ultra-long-horizon work; scored by binary pass@1 resolution rate with reward-hacking-resistant verifiers.

Agentsresolution rate (pass@1)Higher is better

What this benchmark measures

A long-horizon software-engineering benchmark of 20 realistic, multi-hour tasks (library reproductions, full-stack product clones, ML-engineering, and algorithmic optimization) that test whether frontier coding agents can autonomously complete ultra-long-horizon work; scored by binary pass@1 resolution rate with reward-hacking-resistant verifiers.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is resolution rate (pass@1). It should be interpreted within SWE-Marathon, not compared as part of a site-wide ranking.

No composite ranking
evals.report never combines benchmarks. resolution rate (pass@1) on SWE-Marathon is its own number — don’t average it with other metrics.