evals.report
BenchmarksLabsCompareRun guides

SWE-fficiency

Measures whether coding agents can optimize real-world repositories for performance: generate a pull request that speeds up a target workload while keeping the repository's existing tests passing (498 tasks across 9 large Python repos).

Codingspeedup scoreHigher is better

What this benchmark measures

Measures whether coding agents can optimize real-world repositories for performance: generate a pull request that speeds up a target workload while keeping the repository's existing tests passing (498 tasks across 9 large Python repos).

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is speedup score. It should be interpreted within SWE-fficiency, not compared as part of a site-wide ranking.

What to be careful about

Score reflects achieved speedup relative to expert optimizations; results are scaffold- and hardware-sensitive, so record the run setup.

No composite ranking
evals.report never combines benchmarks. speedup score on SWE-fficiency is its own number — don’t average it with other metrics.