evals.report
BenchmarksLabsCompareRun guides

SWE-rebench

A continuously updated, contamination-free agentic software-engineering benchmark from Nebius that mines fresh post-cutoff GitHub issue/PR tasks and evaluates LLM agents under a fixed ReAct scaffold, reporting the monthly decontaminated resolved rate.

CodingResolved rate (pass@1)Higher is better

No run guide for this benchmark yet.