evals.report
BenchmarksSourcesLabsCompareRun guides
SourcesCoding

DeepSWE

Long-horizon software-engineering benchmark with original tasks, broad repo coverage, and behavioral verifiers.

Ready nowStatic HTMLReview neededRun guide readyPublic data
Official source Benchmark page

Source detail

Score source

Official blog/data browser exposes leaderboard rows, rollout outcomes, and trial metadata.

Run guide

Official guide documents Pier/Harbor-compatible execution with mini-swe-agent, subsets, single-task runs, and submission.

How it can be used

Start with official blog rows and task manifest, then add trial-level detail when the raw index is pinned.

Caveat

All leaderboard scores use mini-swe-agent; store harness, reasoning effort, sample count, confidence interval, and cost metadata.

Evidence links 4