evals.report
BenchmarksLabsCompareRun guides

GSO: Software Optimization Benchmark for SWE-Agents

GSO evaluates AI coding agents on 102 challenging real-world software performance optimization tasks across 10 codebases in 5 languages, measuring whether an agent's patch matches expert-developer speedups while remaining correct.

CodingOpt@1Higher is better
ModelLabScoreSource modelStatusDate
Claude Opus 4.7Anthropic44.12%OfficialApr 16, 2026Details
Claude Opus 4.6Anthropic41.18%OfficialFeb 5, 2026Details
GPT-5.5OpenAI40.2%OfficialApr 23, 2026Details
GPT-5.4OpenAI31.37%OfficialMar 5, 2026Details
GPT-5.2OpenAI27.45%OfficialDec 11, 2025Details
Claude Opus 4.5Anthropic26.47%OfficialNov 24, 2025Details
Gemini 3.1 Pro PreviewGoogle DeepMind22.55%OfficialFeb 19, 2026Details
Gemini 3 ProGoogle DeepMind18.63%OfficialNov 18, 2025Details
Claude Sonnet 4.5Anthropic14.71%OfficialSep 29, 2025Details
GPT-5.1OpenAI13.73%OfficialNov 12, 2025Details
Gemini 3 FlashGoogle DeepMind9.8%OfficialDec 17, 2025Details
o3OpenAI8.82%OfficialApr 16, 2025Details
GPT-5OpenAI6.86%OfficialAug 7, 2025Details
Claude Opus 4Anthropic6.86%OfficialMay 22, 2025Details
Qwen 3 Coder 480BAlibaba / Qwen4.9%OfficialJul 22, 2025Details
Kimi K2 InstructMoonshot AI4.9%OfficialJul 11, 2025Details
Claude Sonnet 4Anthropic4.9%OfficialMay 22, 2025Details
Claude 3.5 SonnetAnthropic4.6%OfficialJun 20, 2024Details
Gemini 2.5 ProGoogle DeepMind3.92%OfficialMar 25, 2025Details
Claude 3.7 SonnetAnthropic3.8%OfficialFeb 24, 2025Details
o4-miniOpenAI3.6%OfficialApr 16, 2025Details
GPT-4oOpenAI0.0%OfficialMay 13, 2024Details

Each row reports the model’s Opt@1 on GSO: Software Optimization Benchmark for SWE-Agents. Click a row for the full run context.