BenchmarksCoding
SWE-bench Verified
A curated SWE-bench split for evaluating systems that resolve real software engineering issues.
Coding% resolvedHigher is better
Run the verified SWE-bench split with a fixed agent scaffold, repository setup, and scoring harness. Keep the agent scaffold, model, tool access, and harness version attached to any reported score.
1Install
shell
git clone https://github.com/SWE-bench/SWE-bench.gitshell
cd SWE-benchshell
python -m venv .venvshell
source .venv/bin/activateshell
pip install -e .2Run evaluation
shell
python -m swebench.run_evaluation --dataset_name SWE-bench/SWE-bench_Verified --split test --predictions_path ./predictions.jsonl3Score output
shell
python -m swebench.harness.report --run_id eval-run4Expected output
A per-instance report with resolved/unresolved status and an aggregate resolved percentage for this benchmark only.
5Submit results
Follow the official SWE-bench submission instructions and include scaffold, tool access, split, and harness details.
Gotchas
Agent scaffold and tool access affect comparability.
Use the same split/version as the official leaderboard.
Patch formatting and environment setup are common failure points.