evals.report
BenchmarksLabsCompareRun guides

GSO: Software Optimization Benchmark for SWE-Agents

GSO evaluates AI coding agents on 102 challenging real-world software performance optimization tasks across 10 codebases in 5 languages, measuring whether an agent's patch matches expert-developer speedups while remaining correct.

CodingOpt@1Higher is better

What this benchmark measures

GSO evaluates AI coding agents on 102 challenging real-world software performance optimization tasks across 10 codebases in 5 languages, measuring whether an agent's patch matches expert-developer speedups while remaining correct.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is Opt@1. It should be interpreted within GSO: Software Optimization Benchmark for SWE-Agents, not compared as part of a site-wide ranking.

No composite ranking
evals.report never combines benchmarks. Opt@1 on GSO: Software Optimization Benchmark for SWE-Agents is its own number — don’t average it with other metrics.