evals.report
BenchmarksSourcesLabsCompareRun guides

SWE-bench Verified

A curated SWE-bench split for evaluating systems that resolve real software engineering issues.

Coding% resolvedHigher is better
ModelLabScoreSource modelStatusDate
Claude Opus 4.8Anthropic88.6%Claude Opus 4.8VerifiedMay 28, 2026Details
Claude Opus 4.7Anthropic83.5%Claude Opus 4.7OfficialMay 30, 2026Details
GPT-5.5OpenAI80.6%GPT-5.5OfficialMay 30, 2026Details
Claude Opus 4.6Anthropic78.7%Claude Opus 4.6OfficialMay 30, 2026Details
GPT-5.4OpenAI76.9%GPT-5.4OfficialMay 30, 2026Details
Kimi K2.6Moonshot AI76.7%Kimi K2.6OfficialMay 30, 2026Details
Claude Opus 4.5Anthropic76.7%Claude Opus 4.5OfficialMay 30, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind75.6%Gemini 3.1 ProOfficialMay 30, 2026Details
Gemini 3 FlashGoogle DeepMind75.4%Gemini 3 FlashOfficialMay 30, 2026Details
Claude Sonnet 4.6Anthropic75.2%Claude Sonnet 4.6OfficialMay 30, 2026Details
GPT-5.3-CodexOpenAI74.8%GPT-5.3 CodexOfficialMay 30, 2026Details
GLM-5.1Z.ai74.2%GLM-5.1OfficialMay 30, 2026Details
Kimi K2.5Moonshot AI73.8%Kimi K2.5OfficialMay 30, 2026Details
GPT-5.2OpenAI73.8%GPT-5.2OfficialMay 30, 2026Details
GPT-5 highOpenAI73.6%GPT-5OfficialMay 30, 2026Details
Claude Opus 4.1Anthropic73.3%Claude Opus 4.1OfficialMay 30, 2026Details
Gemini 3 ProGoogle DeepMind72.9%Gemini 3 ProOfficialMay 30, 2026Details
GLM-5Z.ai72.1%GLM-5OfficialMay 30, 2026Details
Claude Sonnet 4.5Anthropic71.3%Claude Sonnet 4.5OfficialMay 30, 2026Details
Claude Opus 4Anthropic70.7%Claude Opus 4OfficialMay 30, 2026Details
GPT-5.1OpenAI68.0%GPT-5.1OfficialMay 30, 2026Details
GPT-5 miniOpenAI64.7%GPT-5 miniOfficialMay 30, 2026Details
o3OpenAI62.3%o3OfficialMay 30, 2026Details
Claude 3.7 SonnetAnthropic61.0%Claude 3.7 SonnetOfficialMay 30, 2026Details

Each row reports the model’s % resolved on SWE-bench Verified. Click a row for the full run context.