BenchmarksCoding
SWE-bench Pro
A harder public software-engineering agent benchmark built around professional repository tasks.
Coding% resolvedHigher is better
What this benchmark measures
A harder public software-engineering agent benchmark built around professional repository tasks.
Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.
The metric shown here is % resolved. It should be interpreted within SWE-bench Pro, not compared as part of a site-wide ranking.
What to be careful about
Track max turns and agent configuration because results are scaffold-dependent.
No composite ranking
evals.report never combines benchmarks. % resolved on SWE-bench Pro is its own number — don’t average it with other metrics.