Question 1

What is SWE-rebench?

Accepted Answer

A continuously updated, contamination-free agentic software-engineering benchmark from Nebius that mines fresh post-cutoff GitHub issue/PR tasks and evaluates LLM agents under a fixed ReAct scaffold, reporting the monthly decontaminated resolved rate. It is a coding benchmark measured by Resolved rate (pass@1).

Question 2

What does Resolved rate (pass@1) mean on SWE-rebench?

Accepted Answer

SWE-rebench reports Resolved rate (pass@1) (%); higher is better. Scores are shown only within SWE-rebench and are never averaged with other benchmarks.

Question 3

What is the top reported SWE-rebench score?

Accepted Answer

Claude Opus 4.6 has the top reported score on SWE-rebench: 65.3% (Resolved rate (pass@1)).

Question 4

Why do SWE-rebench scores differ across runs?

Accepted Answer

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Question 5

Does evals.report rank models across benchmarks?

Accepted Answer

No. SWE-rebench scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

Model	Lab	Score↓	Source model	Status	Date
Claude Opus 4.6	Anthropic	65.3%	—	Official	Feb 5, 2026	Details
GLM-5Open	Z.ai	62.8%	—	Official	Feb 11, 2026	Details
GLM-5.1Open	Z.ai	62.7%	—	Official	Apr 7, 2026	Details
DeepSeek V3.2Open	DeepSeek	60.9%	—	Unverified	Dec 1, 2025	Details
Claude Sonnet 4.6	Anthropic	60.7%	—	Unverified	Feb 17, 2026	Details
GLM-4.7Open	Z.ai	58.7%	—	Unverified	Dec 22, 2025	Details
Kimi K2.5Open	Moonshot AI	58.5%	—	Unverified	Jan 27, 2026	Details
GPT-5.3-Codex	OpenAI	58.2%	—	Unverified	Feb 5, 2026	Details
Gemini 3 Flash	Google DeepMind	57.6%	—	Official	Dec 17, 2025	Details
Gemini 3 Pro	Google DeepMind	56.5%	—	Official	Nov 18, 2025	Details
MiniMax M2.7Open	MiniMax	51.9%	—	Unverified	Mar 18, 2026	Details