SourcesCoding
DeepSWE
Long-horizon software-engineering benchmark with original tasks, broad repo coverage, and behavioral verifiers.
Ready nowStatic HTMLReview neededRun guide readyPublic data
Source detail
Score source
Official blog/data browser exposes leaderboard rows, rollout outcomes, and trial metadata.
Run guide
Official guide documents Pier/Harbor-compatible execution with mini-swe-agent, subsets, single-task runs, and submission.
How it can be used
Start with official blog rows and task manifest, then add trial-level detail when the raw index is pinned.
Caveat
All leaderboard scores use mini-swe-agent; store harness, reasoning effort, sample count, confidence interval, and cost metadata.