evals.report
BenchmarksLabsCompareRun guides

DeepSWE

A long-horizon software-engineering benchmark with original tasks, broad repository coverage, and behavioral verifiers.

Coding% resolvedHigher is better

What this benchmark measures

A long-horizon software-engineering benchmark with original tasks, broad repository coverage, and behavioral verifiers.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is % resolved. It should be interpreted within DeepSWE, not compared as part of a site-wide ranking.

What to be careful about

All leaderboard scores use mini-swe-agent; store harness, reasoning effort, sample count, confidence interval, and cost metadata.

No composite ranking
evals.report never combines benchmarks. % resolved on DeepSWE is its own number — don’t average it with other metrics.