BenchmarksCoding
SWE-bench Verified
A curated SWE-bench split for evaluating systems that resolve real software engineering issues.
Coding% resolvedHigher is better
What this benchmark measures
SWE-bench Verified focuses on software engineering tasks derived from real repositories. A run typically asks a system to inspect an issue, modify code, and produce a patch that passes the benchmark harness.
Rows on this page are source-backed public benchmark reports. Each score keeps the source model name, benchmark version, and available run context attached to the row.
The metric shown here is benchmark-local percent resolved. It should not be averaged with reasoning, preference, or multimodal benchmark scores.
What to be careful about
Agent scaffold, tools, repository setup, and patch validation details affect comparability.
No composite ranking
evals.report never combines benchmarks. % resolved on SWE-bench Verified is its own number — don’t average it with other metrics.