SourcesCoding
SWE-bench Verified
Canonical software-engineering agent benchmark already in product scope.
Ready nowRaw JSONStructured dataRun guide readyMachine-readable
Source detail
Score source
Official site repo exposes leaderboard JSON plus per-instance metadata for model runs.
Run guide
Official SWE-bench repo has harness docs, dataset references, and evaluation flow.
How it can be used
Official leaderboard rows and per-instance metadata can be shown with scaffold and tool context preserved.
Caveat
Results are agent-system results, not pure base-model capability. Store scaffold and tools as run context.