evals.report
BenchmarksSourcesLabsCompareRun guides

SWE-bench Pro

A harder public software-engineering agent benchmark built around professional repository tasks.

Coding% resolvedHigher is better
ModelLabScoreSource modelStatusDate
GPT-5.4 xHighOpenAI59.10%gpt-5.4 (xHigh)*OfficialMay 30, 2026Details
Muse SparkMeta55.00%Muse Spark*OfficialMay 30, 2026Details
Claude Opus 4.6 thinkingAnthropic51.90%claude-opus-4-6 (thinking)*OfficialMay 30, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind46.10%gemini-3.1-pro (thinking)*OfficialMay 30, 2026Details
Claude Opus 4.5Anthropic45.89%claude-opus-4-5-20251101OfficialMay 30, 2026Details
Claude Sonnet 4.5Anthropic43.60%claude-4-5-SonnetOfficialMay 30, 2026Details
Gemini 3 ProGoogle DeepMind43.30%gemini-3-pro-previewOfficialMay 30, 2026Details
Claude Sonnet 4Anthropic42.70%claude-4-SonnetOfficialMay 30, 2026Details
GPT-5 highOpenAI41.78%gpt-5-2025-08-07 (High)OfficialMay 30, 2026Details
GPT-5.2-CodexOpenAI41.04%gpt-5.2-codexOfficialMay 30, 2026Details
Claude Haiku 4.5Anthropic39.45%claude-4-5-haikuOfficialMay 30, 2026Details
Qwen 3 Coder 480BAlibaba / Qwen38.70%qwen3-coder-480b-a35bOfficialMay 30, 2026Details
MiniMax M2.1MiniMax36.81%minimax-2.1OfficialMay 30, 2026Details
Gemini 3 FlashGoogle DeepMind34.63%gemini-3-flashOfficialMay 30, 2026Details
GPT-5.2OpenAI29.94%gpt-5.2OfficialMay 30, 2026Details
Kimi K2 InstructMoonshot AI27.67%kimi-k2-instructOfficialMay 30, 2026Details
Qwen3 235B A22B Instruct 2507Alibaba / Qwen21.41%qwen3-235b-a22bOfficialMay 30, 2026Details
GPT-OSS-120BOpenAI16.20%gpt-oss-120bOfficialMay 30, 2026Details
DeepSeek V3.2DeepSeek15.56%deepseek-v3p2OfficialMay 30, 2026Details
Llama 3.1 405BMeta11.18%llama3-1-405b-instructOfficialMay 30, 2026Details
GLM-4.6Z.ai9.67%glm-4.6OfficialMay 30, 2026Details
Llama 4 MaverickMeta5.24%llama4-maverick-17b-instructOfficialMay 30, 2026Details

Each row reports the model’s % resolved on SWE-bench Pro. Click a row for the full run context.