evals.report
BenchmarksSourcesLabsCompareRun guides

SWE-bench Verified

A curated SWE-bench split for evaluating systems that resolve real software engineering issues.

Coding% resolvedHigher is better

Known official sources 1

Ready nowRaw JSONStructured dataRun guide readyMachine-readable

SWE-bench Verified

Canonical software-engineering agent benchmark already in product scope.

Category
Coding
Owner
SWE-bench
Data path
Official leaderboard rows and per-instance metadata can be shown with scaffold and tool context preserved.
Known caveat
Agent scaffold, tools, repository setup, and patch validation details affect comparability.