BenchmarksCoding
SWE-rebench
A continuously updated, contamination-free agentic software-engineering benchmark from Nebius that mines fresh post-cutoff GitHub issue/PR tasks and evaluates LLM agents under a fixed ReAct scaffold, reporting the monthly decontaminated resolved rate.
CodingResolved rate (pass@1)Higher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Opus 4.6 | Anthropic | 65.3% | — | Official | Feb 5, 2026 | Details |
| GLM-5 | Z.ai | 62.8% | — | Official | Feb 11, 2026 | Details |
| GLM-5.1 | Z.ai | 62.7% | — | Official | Apr 7, 2026 | Details |
| DeepSeek V3.2 | DeepSeek | 60.9% | — | Unverified | Dec 1, 2025 | Details |
| Claude Sonnet 4.6 | Anthropic | 60.7% | — | Unverified | Feb 17, 2026 | Details |
| GLM-4.7 | Z.ai | 58.7% | — | Unverified | Dec 22, 2025 | Details |
| Kimi K2.5 | Moonshot AI | 58.5% | — | Unverified | Jan 27, 2026 | Details |
| GPT-5.3-Codex | OpenAI | 58.2% | — | Unverified | Feb 5, 2026 | Details |
| Gemini 3 Flash | Google DeepMind | 57.6% | — | Official | Dec 17, 2025 | Details |
| Gemini 3 Pro | Google DeepMind | 56.5% | — | Official | Nov 18, 2025 | Details |
| MiniMax M2.7 | MiniMax | 51.9% | — | Unverified | Mar 18, 2026 | Details |
Each row reports the model’s Resolved rate (pass@1) on SWE-rebench. Click a row for the full run context.