evals.report
BenchmarksLabsCompareRun guides

SWE-bench Pro

A harder public software-engineering agent benchmark built around professional repository tasks.

Coding% resolvedHigher is better

What this benchmark measures

A harder public software-engineering agent benchmark built around professional repository tasks.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is % resolved. It should be interpreted within SWE-bench Pro, not compared as part of a site-wide ranking.

What to be careful about

Track max turns and agent configuration because results are scaffold-dependent.

No composite ranking
evals.report never combines benchmarks. % resolved on SWE-bench Pro is its own number — don’t average it with other metrics.