evals.report
BenchmarksSourcesLabsCompareRun guides

SWE-bench Verified

A curated SWE-bench split for evaluating systems that resolve real software engineering issues.

Coding% resolvedHigher is better

Run the verified SWE-bench split with a fixed agent scaffold, repository setup, and scoring harness. Keep the agent scaffold, model, tool access, and harness version attached to any reported score.

Benchmark
SWE-bench Verified
Dataset
huggingface.co/datasets/SWE-bench/SWE-bench_Verified
Metric
% resolved

1Install

shell
git clone https://github.com/SWE-bench/SWE-bench.git
shell
cd SWE-bench
shell
python -m venv .venv
shell
source .venv/bin/activate
shell
pip install -e .

2Run evaluation

shell
python -m swebench.run_evaluation --dataset_name SWE-bench/SWE-bench_Verified --split test --predictions_path ./predictions.jsonl

3Score output

shell
python -m swebench.harness.report --run_id eval-run

4Expected output

A per-instance report with resolved/unresolved status and an aggregate resolved percentage for this benchmark only.

5Submit results

Follow the official SWE-bench submission instructions and include scaffold, tool access, split, and harness details.

Gotchas

Agent scaffold and tool access affect comparability.
Use the same split/version as the official leaderboard.
Patch formatting and environment setup are common failure points.