Question 1

What is SWE-Marathon?

Accepted Answer

A long-horizon software-engineering benchmark of 20 realistic, multi-hour tasks (library reproductions, full-stack product clones, ML-engineering, and algorithmic optimization) that test whether frontier coding agents can autonomously complete ultra-long-horizon work; scored by binary pass@1 resolution rate with reward-hacking-resistant verifiers. It is a agents benchmark measured by resolution rate (pass@1).

Question 2

What does resolution rate (pass@1) mean on SWE-Marathon?

Accepted Answer

SWE-Marathon reports resolution rate (pass@1) (%); higher is better. Scores are shown only within SWE-Marathon and are never averaged with other benchmarks.

Question 3

What is the top reported SWE-Marathon score?

Accepted Answer

Kimi K3 has the top reported score on SWE-Marathon: 42.0% (resolution rate (pass@1)).

Question 4

Why do SWE-Marathon scores differ across runs?

Accepted Answer

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Question 5

Does evals.report rank models across benchmarks?

Accepted Answer

No. SWE-Marathon scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

SWE-Marathon

What this benchmark measures

Frequently asked