evals.report
BenchmarksLabsCompareRun guides

SWE-fficiency

Measures whether coding agents can optimize real-world repositories for performance: generate a pull request that speeds up a target workload while keeping the repository's existing tests passing (498 tasks across 9 large Python repos).

Codingspeedup scoreHigher is better

No run guide for this benchmark yet.