BenchmarksCoding
SWE-bench Pro
A harder public software-engineering agent benchmark built around professional repository tasks.
Coding% resolvedHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| GPT-5.4 xHigh | OpenAI | 59.10% | gpt-5.4 (xHigh)* | Official | May 30, 2026 | Details |
| Muse Spark | Meta | 55.00% | Muse Spark* | Official | May 30, 2026 | Details |
| Claude Opus 4.6 thinking | Anthropic | 51.90% | claude-opus-4-6 (thinking)* | Official | May 30, 2026 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 46.10% | gemini-3.1-pro (thinking)* | Official | May 30, 2026 | Details |
| Claude Opus 4.5 | Anthropic | 45.89% | claude-opus-4-5-20251101 | Official | May 30, 2026 | Details |
| Claude Sonnet 4.5 | Anthropic | 43.60% | claude-4-5-Sonnet | Official | May 30, 2026 | Details |
| Gemini 3 Pro | Google DeepMind | 43.30% | gemini-3-pro-preview | Official | May 30, 2026 | Details |
| Claude Sonnet 4 | Anthropic | 42.70% | claude-4-Sonnet | Official | May 30, 2026 | Details |
| GPT-5 high | OpenAI | 41.78% | gpt-5-2025-08-07 (High) | Official | May 30, 2026 | Details |
| GPT-5.2-Codex | OpenAI | 41.04% | gpt-5.2-codex | Official | May 30, 2026 | Details |
| Claude Haiku 4.5 | Anthropic | 39.45% | claude-4-5-haiku | Official | May 30, 2026 | Details |
| Qwen 3 Coder 480B | Alibaba / Qwen | 38.70% | qwen3-coder-480b-a35b | Official | May 30, 2026 | Details |
| MiniMax M2.1 | MiniMax | 36.81% | minimax-2.1 | Official | May 30, 2026 | Details |
| Gemini 3 Flash | Google DeepMind | 34.63% | gemini-3-flash | Official | May 30, 2026 | Details |
| GPT-5.2 | OpenAI | 29.94% | gpt-5.2 | Official | May 30, 2026 | Details |
| Kimi K2 Instruct | Moonshot AI | 27.67% | kimi-k2-instruct | Official | May 30, 2026 | Details |
| Qwen3 235B A22B Instruct 2507 | Alibaba / Qwen | 21.41% | qwen3-235b-a22b | Official | May 30, 2026 | Details |
| GPT-OSS-120B | OpenAI | 16.20% | gpt-oss-120b | Official | May 30, 2026 | Details |
| DeepSeek V3.2 | DeepSeek | 15.56% | deepseek-v3p2 | Official | May 30, 2026 | Details |
| Llama 3.1 405B | Meta | 11.18% | llama3-1-405b-instruct | Official | May 30, 2026 | Details |
| GLM-4.6 | Z.ai | 9.67% | glm-4.6 | Official | May 30, 2026 | Details |
| Llama 4 Maverick | Meta | 5.24% | llama4-maverick-17b-instruct | Official | May 30, 2026 | Details |
Each row reports the model’s % resolved on SWE-bench Pro. Click a row for the full run context.