SWE-fficiency
Measures whether coding agents can optimize real-world repositories for performance: generate a pull request that speeds up a target workload while keeping the repository's existing tests passing (498 tasks across 9 large Python repos).
What this benchmark measures
Measures whether coding agents can optimize real-world repositories for performance: generate a pull request that speeds up a target workload while keeping the repository's existing tests passing (498 tasks across 9 large Python repos).
Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.
The metric shown here is speedup score. It should be interpreted within SWE-fficiency, not compared as part of a site-wide ranking.
What to be careful about
Score reflects achieved speedup relative to expert optimizations; results are scaffold- and hardware-sensitive, so record the run setup.