BenchmarksCoding
DeepSWE
A long-horizon software-engineering benchmark with original tasks, broad repository coverage, and behavioral verifiers.
Coding% resolvedHigher is better
What this benchmark measures
A long-horizon software-engineering benchmark with original tasks, broad repository coverage, and behavioral verifiers.
Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.
The metric shown here is % resolved. It should be interpreted within DeepSWE, not compared as part of a site-wide ranking.
What to be careful about
All leaderboard scores use mini-swe-agent; store harness, reasoning effort, sample count, confidence interval, and cost metadata.
No composite ranking
evals.report never combines benchmarks. % resolved on DeepSWE is its own number — don’t average it with other metrics.