What does % resolved mean on SWE-bench Verified?

SWE-bench Verified reports % resolved (%); higher is better. Scores are shown only within SWE-bench Verified and are never averaged with other benchmarks.

What is the top reported SWE-bench Verified score?

Claude Fable 5 has the top reported score on SWE-bench Verified: 95.0% (% resolved).

Why do SWE-bench Verified scores differ across runs?

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Does evals.report rank models across benchmarks?

No. SWE-bench Verified scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

BenchmarksCoding

SWE-bench Verified

A curated SWE-bench split for evaluating systems that resolve real software engineering issues.

Coding% resolvedHigher is better

Scores About Run this benchmark

SWE-bench Verified is run locally with the official `swebench` harness (Docker-based). You generate a predictions JSONL from your own model/agent, then `python -m swebench.harness.run_evaluation` builds per-instance Docker images, applies each model_patch, runs the repo test suite, grades resolution, and writes a report automatically. Scoring is folded into the run command — there is no separate report CLI. Keep the agent scaffold, model, tool access, dataset split (Verified), and harness/commit version attached to any reported % resolved.

Benchmark

SWE-bench Verified

Repository

github.com/SWE-bench/SWE-bench

Dataset

huggingface.co/datasets/SWE-bench/SWE-bench_Verified

Metric

% resolved

1Install

shell

git clone https://github.com/SWE-bench/SWE-bench.git

shell

cd SWE-bench

shell

python -m venv .venv

shell

source .venv/bin/activate

shell

pip install -e .

2Run evaluation

shell

# Sanity-check the harness with gold patches first (requires Docker running):
python -m swebench.harness.run_evaluation --predictions_path gold --max_workers 1 --instance_ids sympy__sympy-20590 --run_id validate-gold

shell

# Then evaluate your own model. predictions.jsonl: one JSON object per line with fields instance_id, model_name_or_path, model_patch
python -m swebench.harness.run_evaluation --dataset_name SWE-bench/SWE-bench_Verified --predictions_path ./predictions.jsonl --max_workers 8 --run_id my-eval-run

3Expected output

The run writes per-instance evaluation logs (including a per-instance report.json and test_output.txt) under logs/run_evaluation/<run_id>/<model_name_or_path>/<instance_id>/, and a final summary report JSON named <model_name_or_path>.<run_id>.json (slashes in the model name replaced by __). At the end it prints counts: total instances, instances submitted, instances completed, instances resolved, instances unresolved, instances with empty patches, and instances with errors. The % resolved (resolved / total) is derived from these counts and is specific to SWE-bench Verified — do not combine it with SWE-bench Lite or full SWE-bench numbers.

4Submit results

To appear on the official leaderboard, follow the submission instructions at https://github.com/SWE-bench/experiments (open a PR with your predictions and per-instance logs). Always report the agent scaffold, underlying model, tool access, dataset split (Verified), and the harness commit/version used, since these materially affect the score.

Gotchas

The entrypoint module is `swebench.harness.run_evaluation` (NOT `swebench.run_evaluation`), and --run_id is a required argument (argparse required=True), as is --predictions_path.

There is no separate scoring/report CLI — `swebench.harness.report` does not exist (the harness package only contains reporting.py/grading.py with no __main__ CLI). The final summary report JSON and resolution counts are produced automatically by run_evaluation.

Docker is required and image builds are heavy (tens of GB). On ARM/macOS M-series append `--namespace ''` (or `--namespace none`) to build images locally instead of pulling the default x86 `swebench` namespace images.

predictions.jsonl must use exactly the fields instance_id, model_name_or_path, and model_patch (a unified-diff string); empty patches are counted as 'empty_patch' (unresolved), so a high empty-patch rate usually means your agent timed out before emitting a diff.