Question 1

What is SWE-rebench?

Accepted Answer

A continuously updated, contamination-free agentic software-engineering benchmark from Nebius that mines fresh post-cutoff GitHub issue/PR tasks and evaluates LLM agents under a fixed ReAct scaffold, reporting the monthly decontaminated resolved rate. It is a coding benchmark measured by Resolved rate (pass@1).

Question 2

What does Resolved rate (pass@1) mean on SWE-rebench?

Accepted Answer

SWE-rebench reports Resolved rate (pass@1) (%); higher is better. Scores are shown only within SWE-rebench and are never averaged with other benchmarks.

Question 3

What is the top reported SWE-rebench score?

Accepted Answer

Claude Opus 4.6 has the top reported score on SWE-rebench: 65.3% (Resolved rate (pass@1)).

Question 4

Why do SWE-rebench scores differ across runs?

Accepted Answer

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Question 5

Does evals.report rank models across benchmarks?

Accepted Answer

No. SWE-rebench scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".