BenchmarksCoding
SWE-rebench
A continuously updated, contamination-free agentic software-engineering benchmark from Nebius that mines fresh post-cutoff GitHub issue/PR tasks and evaluates LLM agents under a fixed ReAct scaffold, reporting the monthly decontaminated resolved rate.
CodingResolved rate (pass@1)Higher is better
No run guide for this benchmark yet.