evals.report
BenchmarksLabsCompareRun guides

SWE-fficiency is a 498-task repository-level performance-optimization benchmark over 9 Python libraries (numpy, scipy, pandas, scikit-learn, matplotlib, xarray, sympy, dask, astropy). You install the official `swefficiency` package, generate patches for your model via the `scripts/inference/custom.py` harness (which pulls the dataset from Hugging Face and runs your agent inside prebuilt per-instance Docker images), run a gold baseline plus your predictions through `swefficiency eval`, then aggregate with `swefficiency report`. Keep attached to any reported score: the harness commit (branch swefficiency_base), the dataset split (test, 498 instances), the agent scaffold/spec used, per-worker resources (4 vCPUs / 16 GB RAM), and that the headline metric is the harmonic-mean Speedup Ratio (overall_score).

Benchmark
SWE-fficiency
Dataset
huggingface.co/datasets/swefficiency/swefficiency
Metric
speedup score

1Install

shell
git clone https://github.com/swefficiency/swefficiency.git
shell
cd swefficiency
shell
uv venv --python 3.12
shell
source .venv/bin/activate
shell
uv sync
shell
# Alternatively: pip install -e .
# VM/Docker setup for reproducible CPU/memory pinning (Linux host; matches paper's GCP n2-standard-64):
bash scripts/vm/setup_vm.sh
shell
sudo scripts/vm/setup_docker.sh MEM_MAX MEM_HIGH

2Run evaluation

shell
# Generate model patches with the inference harness (loads dataset from HF, runs your agent in the prebuilt Docker image, writes git patches):
python scripts/inference/custom.py \
  --run-id my_run \
  --spec scripts/inference/specs/cursor_cli.yaml \
  --num-workers 4 \
  --instance-ids numpy__numpy-18065 pandas-dev__pandas-28447 \
  --var cursor_cli_args="--max-steps 75"
shell
# Patches land under logs/run_inference/<run_id>/<spec_name>/<instance_id>/patch.diff
# Convert the per-instance patch.diff files into a predictions.jsonl with one record per line:
#   {"instance_id": "<id>", "model_patch": "<patch_text>", "model_name_or_path": "<model_name>"}

3Score output

shell
# Step 1: establish gold (expert) baseline performance (no --prediction_path):
swefficiency eval --run_id my_eval --num_workers 12
shell
# Step 2: evaluate your model predictions (correctness tests + workload timing):
swefficiency eval --run_id my_eval --num_workers 12 --prediction_path predictions.jsonl
shell
# Step 3: aggregate into the final report (CSV + JSON):
swefficiency report \
    --gold_run logs/run_evaluation/my_eval/gold \
    --pred_run logs/run_evaluation/my_eval/<model_name>

4Expected output

swefficiency eval writes per-instance logs and raw runtime measurements (perf_summary.txt, covering_test_status.json) under logs/run_evaluation/my_eval/gold/ and logs/run_evaluation/my_eval/<model_name>/. swefficiency report writes two files into eval_reports/: eval_report_<model_name>.csv (per-instance results) and eval_report_<model_name>.json with summary metrics including overall_score (harmonic mean of Speedup Ratios = the headline 'speedup score'), proportion_incorrect, proportion_correct_but_no_speedup, proportion_correct_with_speedup_but_human_no_speedup, and proportion_human_speedup_or_better. SR is (model speedup)/(expert speedup); SR>1.0 means the model beat the human PR. Do not compare against numbers produced under different resource pinning, dataset splits, or agent scaffolds.

5Submit results

There is no auto-submit CLI in the harness; report the overall_score (harmonic-mean Speedup Ratio) from eval_report_<model_name>.json. Always attach the run context to any number: harness commit/branch (swefficiency_base), dataset = swefficiency/swefficiency split 'test' (498 instances) and whether you ran the full set or a subset via --instance-ids/--instance-regex, the agent scaffold/spec YAML used for generation, per-task limits (paper uses 3h wall-clock and 100 max actions), and per-worker resources (4 vCPUs / 16 GB RAM, recommended --num_workers 12 on n2-standard-64). The project homepage is https://swefficiency.com and paper https://arxiv.org/abs/2511.06090 for leaderboard reference.

Gotchas

Two eval passes are mandatory: you MUST run the gold baseline (swefficiency eval --run_id my_eval --num_workers 12 with no --prediction_path) before running predictions, because SR is normalized against expert speedup measured on the same machine. swefficiency report errors out if logs/run_evaluation/my_eval/gold is missing, and it needs both the gold and the <model_name> run directories.
Speedup measurements are hardware-sensitive: the harness pins CPU/memory (4 vCPUs, 16 GB RAM per worker) via scripts/vm/setup_docker.sh, which requires sudo and a Linux Docker host. Skipping this or running on different hardware makes SR numbers non-comparable to the leaderboard.
The README prose (README.md) refers to a 'cursor.py' inference harness, but the actual entrypoint file is scripts/inference/custom.py and it uses dash-style flags (--run-id, --num-workers, --instance-ids, --spec, --var), whereas the swefficiency eval/report CLI uses underscore flags (--run_id, --num_workers, --prediction_path, --gold_run, --pred_run). Do not mix the two flag styles.
The repo's default branch is swefficiency_base (not main/master), and inference requires the prebuilt Docker images (docker pull swefficiency/swefficiency_images:<id>; pulling is on by default, use --no-pull only if images already exist locally). A failed/non-applying patch is scored as model speedup 1.0 (no improvement) rather than a hard error.