BenchmarksCoding
SWE-bench Multimodal
A software-engineering benchmark of 517 real GitHub issues from visual JavaScript/web projects that include visual context (screenshots, UI mockups, diagrams), measuring whether AI systems can resolve issues whose fixes are verified by the repository's tests.
Coding% resolvedHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Mythos Preview | Anthropic | 59.0% | — | Verified | Apr 7, 2026 | Details |
| Claude Opus 4.8 | Anthropic | 38.4% | — | Verified | May 28, 2026 | Details |
| o3 | OpenAI | 35.98% | — | Verified | Apr 16, 2025 | Details |
| Claude Sonnet 4 | Anthropic | 35.59% | — | Verified | May 22, 2025 | Details |
| o4-mini | OpenAI | 33.85% | — | Verified | Apr 16, 2025 | Details |
| Claude 3.7 Sonnet | Anthropic | 31.33% | — | Verified | Feb 24, 2025 | Details |
| GPT-4.1 | OpenAI | 31.14% | — | Verified | Apr 14, 2025 | Details |
| GPT-4o | OpenAI | 30.37% | — | Verified | May 13, 2024 | Details |
| Claude 3.5 Sonnet | Anthropic | 25.34% | — | Verified | Jun 20, 2024 | Details |
Each row reports the model’s % resolved on SWE-bench Multimodal. Click a row for the full run context.