BenchmarksCoding
DeepSWE
A long-horizon software-engineering benchmark with original tasks, broad repository coverage, and behavioral verifiers.
Coding% resolvedHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| GPT-5.5 | OpenAI | 70.05% | gpt-5-5 | Official | — | Details |
| GPT-5.4 | OpenAI | 55.53% | gpt-5-4 | Official | — | Details |
| Claude Opus 4.7 | Anthropic | 54.20% | claude-opus-4-7 | Official | — | Details |
| Claude Sonnet 4.6 | Anthropic | 31.56% | claude-sonnet-4-6 | Official | — | Details |
| Gemini 3.5 Flash | Google DeepMind | 28.32% | gemini-3-5-flash | Official | — | Details |
| Claude Opus 4.6 | Anthropic | 27.06% | claude-opus-4-6 | Official | — | Details |
| Kimi K2.6 | Moonshot AI | 23.89% | kimi-k2-6 | Official | — | Details |
| GLM-5.1 | Z.ai | 17.48% | glm-5-1 | Official | — | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 9.88% | gemini-3-1-pro-preview | Official | — | Details |
| DeepSeek V4 Pro | DeepSeek | 7.52% | deepseek-v4-pro | Official | — | Details |
| Gemini 3 Flash | Google DeepMind | 5.16% | gemini-3-flash-preview | Official | — | Details |
| Claude Haiku 4.5 | Anthropic | 0.22% | claude-haiku-4-5 | Official | — | Details |
Each row reports the model’s % resolved on DeepSWE. Click a row for the full run context.