SWE-rebench
A continuously updated, contamination-free agentic software-engineering benchmark from Nebius that mines fresh post-cutoff GitHub issue/PR tasks and evaluates LLM agents under a fixed ReAct scaffold, reporting the monthly decontaminated resolved rate.
What this benchmark measures
A continuously updated, contamination-free agentic software-engineering benchmark from Nebius that mines fresh post-cutoff GitHub issue/PR tasks and evaluates LLM agents under a fixed ReAct scaffold, reporting the monthly decontaminated resolved rate.
Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.
The metric shown here is Resolved rate (pass@1). It should be interpreted within SWE-rebench, not compared as part of a site-wide ranking.