evals.report
BenchmarksLabsCompareRun guides

SWE-rebench

A continuously updated, contamination-free agentic software-engineering benchmark from Nebius that mines fresh post-cutoff GitHub issue/PR tasks and evaluates LLM agents under a fixed ReAct scaffold, reporting the monthly decontaminated resolved rate.

CodingResolved rate (pass@1)Higher is better
ModelLabScoreSource modelStatusDate
Claude Opus 4.6Anthropic65.3%OfficialFeb 5, 2026Details
GLM-5Z.ai62.8%OfficialFeb 11, 2026Details
GLM-5.1Z.ai62.7%OfficialApr 7, 2026Details
DeepSeek V3.2DeepSeek60.9%UnverifiedDec 1, 2025Details
Claude Sonnet 4.6Anthropic60.7%UnverifiedFeb 17, 2026Details
GLM-4.7Z.ai58.7%UnverifiedDec 22, 2025Details
Kimi K2.5Moonshot AI58.5%UnverifiedJan 27, 2026Details
GPT-5.3-CodexOpenAI58.2%UnverifiedFeb 5, 2026Details
Gemini 3 FlashGoogle DeepMind57.6%OfficialDec 17, 2025Details
Gemini 3 ProGoogle DeepMind56.5%OfficialNov 18, 2025Details
MiniMax M2.7MiniMax51.9%UnverifiedMar 18, 2026Details

Each row reports the model’s Resolved rate (pass@1) on SWE-rebench. Click a row for the full run context.