evals.report
BenchmarksLabsCompareRun guides

SWE-rebench

A continuously updated, contamination-free agentic software-engineering benchmark from Nebius that mines fresh post-cutoff GitHub issue/PR tasks and evaluates LLM agents under a fixed ReAct scaffold, reporting the monthly decontaminated resolved rate.

CodingResolved rate (pass@1)Higher is better

What this benchmark measures

A continuously updated, contamination-free agentic software-engineering benchmark from Nebius that mines fresh post-cutoff GitHub issue/PR tasks and evaluates LLM agents under a fixed ReAct scaffold, reporting the monthly decontaminated resolved rate.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is Resolved rate (pass@1). It should be interpreted within SWE-rebench, not compared as part of a site-wide ranking.

No composite ranking
evals.report never combines benchmarks. Resolved rate (pass@1) on SWE-rebench is its own number — don’t average it with other metrics.