What does % resolved mean on SWE-bench Pro?

SWE-bench Pro reports % resolved (%); higher is better. Scores are shown only within SWE-bench Pro and are never averaged with other benchmarks.

What is the top reported SWE-bench Pro score?

Claude Fable 5 has the top reported score on SWE-bench Pro: 80.0% (% resolved).

Why do SWE-bench Pro scores differ across runs?

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Does evals.report rank models across benchmarks?

No. SWE-bench Pro scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

BenchmarksCoding

SWE-bench Pro

A harder public software-engineering agent benchmark built around professional repository tasks.

Coding% resolvedHigher is better

Scores About Run this benchmark

SWE-bench Pro is run via Scale AI's official open-source harness (scaleapi/SWE-bench_Pro-os) against the public ScaleAI/SWE-bench_Pro dataset (single 731-instance test split). You generate patches with your own model/agent (the repo ships SWE-agent and mini-swe-agent as submodules producing .pred files), collate them into a JSON with helper_code/gather_patches.py, then grade them in per-instance prebuilt Docker images via swe_bench_pro_eval.py (Modal by default, or local Docker with --use_local_docker). Keep attached to any score: the harness commit, the scaffold used to generate patches (SWE-agent / mini-swe-agent / your own agent), whether you ran Modal or local Docker, and that it was the public test split.

Benchmark

SWE-bench Pro

Repository

github.com/scaleapi/SWE-bench_Pro-os

Dataset

huggingface.co/datasets/ScaleAI/SWE-bench_Pro

Metric

% resolved

1Install

shell

git clone --recurse-submodules https://github.com/scaleapi/SWE-bench_Pro-os.git

shell

cd SWE-bench_Pro-os

shell

pip install -r requirements.txt

shell

# Install Docker (see https://docs.docker.com/engine/install/)
modal setup  # follow prompts to generate your token (verify token_id/token_secret/active=true in ~/.modal.toml); or skip and use --use_local_docker (beta) with a local Docker install

2Run evaluation

shell

# Step 1: Generate patches with your model. Follow the bundled SWE-agent (or mini-swe-agent) git submodule instructions to run your model on the SWE-bench Pro instances; it produces a '.pred' file per instance (e.g. under swe_bench_pro_results/sample1/).
# Step 2: Collate the per-instance .pred files into a single predictions JSON
python helper_code/gather_patches.py --directory swe_bench_pro_results/sample1 --prefix sample1 --output sample1_patches.json

3Score output

shell

# Score predictions against the gold tests in per-instance Docker images (Modal by default; add --use_local_docker for local Docker)
python swe_bench_pro_eval.py --raw_sample_path=swe_bench_pro_full.csv --patch_path=sample1_patches.json --output_dir=eval_output --scripts_dir=run_scripts --num_workers=100 --dockerhub_username=jefzda

shell

# Optional sanity check: extract the dataset's gold patches and confirm they resolve
python helper_code/extract_gold_patches.py --output gold_patches.json

4Expected output

swe_bench_pro_eval.py runs each patch in a Modal (or local Docker) sandbox using the per-instance Docker Hub images, executes the run_scripts tests, and calculates overall accuracy from test pass/fail status. An instance counts as resolved only if its fail_to_pass tests now pass and its pass_to_pass tests still pass; the headline metric is the Resolve Rate (% of instances resolved), written per-instance to --output_dir. Report it only against the SWE-bench Pro public test split (731 instances); do not compare it to SWE-bench / SWE-bench Verified resolve rates, which use a different instance set.

5Submit results

There is no automated submission endpoint in the repo; the public leaderboard is at scale.com/leaderboard/swe_bench_pro_public (Scale AI curates entries; there is also a private commercial leaderboard at labs.scale.com/leaderboard/swe_bench_pro_private). When reporting a self-run score, attach: the harness commit of scaleapi/SWE-bench_Pro-os, the exact scaffold used to generate patches (the bundled SWE-agent vs mini-swe-agent vs a custom agent) since the metric is scaffold-dependent, whether evaluation ran on Modal or local Docker, the dataset split (public test, 731 instances), and the dockerhub image source (jefzda/sweap-images). Run the gold-patch path first to confirm your environment scores known-good patches correctly.

Gotchas

Patch generation and grading are separate: the score depends heavily on the scaffold/agent you use to produce the .pred patches (the README notes SWE-agent reproduces the Sonnet 4.5 results, and mini-swe-agent is comparable), not just the model. Always report the scaffold alongside the number.

Evaluation requires the prebuilt per-instance Docker images at jefzda/sweap-images (selected via each row's dockerhub_tag column); pulling images for 731 instances is bandwidth/disk heavy. Default execution uses Modal (run 'modal setup' and verify token_id/token_secret/active=true in ~/.modal.toml); --use_local_docker is a beta alternative with no extra setup.

--raw_sample_path expects a CSV (the README example uses swe_bench_pro_full.csv, which is NOT checked into the repo) with specific columns the script reads: instance_id, before_repo_set_cmd, selected_test_files_to_run, base_commit, base_dockerfile, instance_dockerfile, FAIL_TO_PASS, PASS_TO_PASS. Derive this CSV from the public ScaleAI/SWE-bench_Pro HF dataset; a plain dataset dump is not sufficient unless it contains these columns. (Note: the script docstring's internal example name 'sweap_pro_eval_modal.py' is an artifact — the actual file to run is swe_bench_pro_eval.py at the repo root, which handles both Modal and local Docker.)

Clone with --recurse-submodules (the repo pins SWE-agent and mini-swe-agent submodules); a plain clone leaves the patch-generation step empty. The eval script also expects run_scripts/{instance_id}/run_script.sh and parser.py per instance. num_workers=100 in the example is aggressive — lower it on constrained machines.