evals.report
BenchmarksLabsCompareRun guides

SWE-bench Multimodal

A software-engineering benchmark of 517 real GitHub issues from visual JavaScript/web projects that include visual context (screenshots, UI mockups, diagrams), measuring whether AI systems can resolve issues whose fixes are verified by the repository's tests.

Coding% resolvedHigher is better
ModelLabScoreSource modelStatusDate
Claude Mythos PreviewAnthropic59.0%VerifiedApr 7, 2026Details
Claude Opus 4.8Anthropic38.4%VerifiedMay 28, 2026Details
o3OpenAI35.98%VerifiedApr 16, 2025Details
Claude Sonnet 4Anthropic35.59%VerifiedMay 22, 2025Details
o4-miniOpenAI33.85%VerifiedApr 16, 2025Details
Claude 3.7 SonnetAnthropic31.33%VerifiedFeb 24, 2025Details
GPT-4.1OpenAI31.14%VerifiedApr 14, 2025Details
GPT-4oOpenAI30.37%VerifiedMay 13, 2024Details
Claude 3.5 SonnetAnthropic25.34%VerifiedJun 20, 2024Details

Each row reports the model’s % resolved on SWE-bench Multimodal. Click a row for the full run context.