BenchmarksCoding
SWE-bench Multilingual
A software-engineering benchmark of 300 curated GitHub issue-resolution tasks spanning 42 repositories and 9 programming languages (C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby, Rust), measuring the percentage of real-world issues a model can resolve so that fail-to-pass and pass-to-pass tests succeed.
Coding% resolvedHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Gemini 3 Flash | Google DeepMind | 72.7% | — | Official | Dec 17, 2025 | Details |
| Claude Opus 4.6 | Anthropic | 72.0% | — | Official | Feb 5, 2026 | Details |
| Claude Opus 4.5 | Anthropic | 70.7% | — | Official | Nov 24, 2025 | Details |
| GLM-5 | Z.ai | 69.7% | — | Official | Feb 11, 2026 | Details |
| Gemini 3 Pro | Google DeepMind | 68.7% | — | Official | Nov 18, 2025 | Details |
| MiniMax M2.5 | MiniMax | 68.3% | — | Official | Feb 12, 2026 | Details |
| Kimi K2.5 | Moonshot AI | 67.3% | — | Official | Jan 27, 2026 | Details |
| Claude Sonnet 4.5 | Anthropic | 67.0% | — | Official | Sep 29, 2025 | Details |
| GPT-5.2 | OpenAI | 66.7% | — | Official | Dec 11, 2025 | Details |
| GPT-5.2-Codex | OpenAI | 66.3% | — | Official | Dec 18, 2025 | Details |
| Claude Haiku 4.5 | Anthropic | 64.7% | — | Official | Oct 15, 2025 | Details |
| DeepSeek V3.2 | DeepSeek | 59.0% | — | Official | Dec 1, 2025 | Details |
| Claude 3.7 Sonnet | Anthropic | 43% | — | Verified | Feb 24, 2025 | Details |
| GPT-5 mini | OpenAI | 39.7% | — | Official | Aug 7, 2025 | Details |
Each row reports the model’s % resolved on SWE-bench Multilingual. Click a row for the full run context.