evals.report
BenchmarksSourcesLabsCompareRun guides

Official guide documents Pier/Harbor-compatible execution with mini-swe-agent, subsets, single-task runs, and submission.

Benchmark
DeepSWE
Dataset
deepswe.datacurve.ai/data
Metric
% resolved

1Expected output

Use the official source links for current output format, submission steps, and benchmark-specific result files.

2Submit results

Keep source URL, source model name, benchmark version, harness, and run context attached to any reported score.

Gotchas

All leaderboard scores use mini-swe-agent; store harness, reasoning effort, sample count, confidence interval, and cost metadata.
Do not mix this benchmark's metric with unrelated benchmark metrics.