evals.report
BenchmarksLabsCompareRun guides

SWE-bench Pro

A harder public software-engineering agent benchmark built around professional repository tasks.

Coding% resolvedHigher is better

Repo includes harness scripts, Dockerfiles, and run scripts.

Benchmark
SWE-bench Pro
Dataset
Not provided
Metric
% resolved

1Expected output

Use the official source links for current output format, submission steps, and benchmark-specific result files.

2Submit results

Keep source URL, source model name, benchmark version, harness, and run context attached to any reported score.

Gotchas

Track max turns and agent configuration because results are scaffold-dependent.
Do not mix this benchmark's metric with unrelated benchmark metrics.