Run SWE-fficiency
The same run guide is also available from the benchmark detail page.
SWE-fficiency is a 498-task repository-level performance-optimization benchmark over 9 Python libraries (numpy, scipy, pandas, scikit-learn, matplotlib, xarray, sympy, dask, astropy). You install the official `swefficiency` package, generate patches for your model via the `scripts/inference/custom.py` harness (which pulls the dataset from Hugging Face and runs your agent inside prebuilt per-instance Docker images), run a gold baseline plus your predictions through `swefficiency eval`, then aggregate with `swefficiency report`. Keep attached to any reported score: the harness commit (branch swefficiency_base), the dataset split (test, 498 instances), the agent scaffold/spec used, per-worker resources (4 vCPUs / 16 GB RAM), and that the headline metric is the harmonic-mean Speedup Ratio (overall_score).
1Install
git clone https://github.com/swefficiency/swefficiency.gitcd swefficiencyuv venv --python 3.12source .venv/bin/activateuv sync# Alternatively: pip install -e .
# VM/Docker setup for reproducible CPU/memory pinning (Linux host; matches paper's GCP n2-standard-64):
bash scripts/vm/setup_vm.shsudo scripts/vm/setup_docker.sh MEM_MAX MEM_HIGH2Run evaluation
# Generate model patches with the inference harness (loads dataset from HF, runs your agent in the prebuilt Docker image, writes git patches):
python scripts/inference/custom.py \
--run-id my_run \
--spec scripts/inference/specs/cursor_cli.yaml \
--num-workers 4 \
--instance-ids numpy__numpy-18065 pandas-dev__pandas-28447 \
--var cursor_cli_args="--max-steps 75"# Patches land under logs/run_inference/<run_id>/<spec_name>/<instance_id>/patch.diff
# Convert the per-instance patch.diff files into a predictions.jsonl with one record per line:
# {"instance_id": "<id>", "model_patch": "<patch_text>", "model_name_or_path": "<model_name>"}3Score output
# Step 1: establish gold (expert) baseline performance (no --prediction_path):
swefficiency eval --run_id my_eval --num_workers 12# Step 2: evaluate your model predictions (correctness tests + workload timing):
swefficiency eval --run_id my_eval --num_workers 12 --prediction_path predictions.jsonl# Step 3: aggregate into the final report (CSV + JSON):
swefficiency report \
--gold_run logs/run_evaluation/my_eval/gold \
--pred_run logs/run_evaluation/my_eval/<model_name>4Expected output
swefficiency eval writes per-instance logs and raw runtime measurements (perf_summary.txt, covering_test_status.json) under logs/run_evaluation/my_eval/gold/ and logs/run_evaluation/my_eval/<model_name>/. swefficiency report writes two files into eval_reports/: eval_report_<model_name>.csv (per-instance results) and eval_report_<model_name>.json with summary metrics including overall_score (harmonic mean of Speedup Ratios = the headline 'speedup score'), proportion_incorrect, proportion_correct_but_no_speedup, proportion_correct_with_speedup_but_human_no_speedup, and proportion_human_speedup_or_better. SR is (model speedup)/(expert speedup); SR>1.0 means the model beat the human PR. Do not compare against numbers produced under different resource pinning, dataset splits, or agent scaffolds.
5Submit results
There is no auto-submit CLI in the harness; report the overall_score (harmonic-mean Speedup Ratio) from eval_report_<model_name>.json. Always attach the run context to any number: harness commit/branch (swefficiency_base), dataset = swefficiency/swefficiency split 'test' (498 instances) and whether you ran the full set or a subset via --instance-ids/--instance-regex, the agent scaffold/spec YAML used for generation, per-task limits (paper uses 3h wall-clock and 100 max actions), and per-worker resources (4 vCPUs / 16 GB RAM, recommended --num_workers 12 on n2-standard-64). The project homepage is https://swefficiency.com and paper https://arxiv.org/abs/2511.06090 for leaderboard reference.