evals.report
BenchmarksSourcesLabsCompareRun guides
SourcesCoding

SWE-bench Verified

Canonical software-engineering agent benchmark already in product scope.

Ready nowRaw JSONStructured dataRun guide readyMachine-readable
Official source Benchmark page

Source detail

Score source

Official site repo exposes leaderboard JSON plus per-instance metadata for model runs.

Run guide

Official SWE-bench repo has harness docs, dataset references, and evaluation flow.

How it can be used

Official leaderboard rows and per-instance metadata can be shown with scaffold and tool context preserved.

Caveat

Results are agent-system results, not pure base-model capability. Store scaffold and tools as run context.

Evidence links 3