evals.report
BenchmarksLabsCompareRun guides

SWE-fficiency

Measures whether coding agents can optimize real-world repositories for performance: generate a pull request that speeds up a target workload while keeping the repository's existing tests passing (498 tasks across 9 large Python repos).

Codingspeedup scoreHigher is better
ModelLabScoreSource modelStatusDate
MiniMax M3MiniMax34.8%MiniMax M3VerifiedDetails

Each row reports the model’s speedup score on SWE-fficiency. Click a row for the full run context.