evals.report
BenchmarksLabsCompareRun guides

SWE-Marathon

A long-horizon software-engineering benchmark of 20 realistic, multi-hour tasks (library reproductions, full-stack product clones, ML-engineering, and algorithmic optimization) that test whether frontier coding agents can autonomously complete ultra-long-horizon work; scored by binary pass@1 resolution rate with reward-hacking-resistant verifiers.

Agentsresolution rate (pass@1)Higher is better

No run guide for this benchmark yet.