evals.report
BenchmarksSourcesLabsCompareRun guides

DeepSWE

A long-horizon software-engineering benchmark with original tasks, broad repository coverage, and behavioral verifiers.

Coding% resolvedHigher is better
ModelLabScoreSource modelStatusDate
GPT-5.5OpenAI70.05%gpt-5-5OfficialDetails
GPT-5.4OpenAI55.53%gpt-5-4OfficialDetails
Claude Opus 4.7Anthropic54.20%claude-opus-4-7OfficialDetails
Claude Sonnet 4.6Anthropic31.56%claude-sonnet-4-6OfficialDetails
Gemini 3.5 FlashGoogle DeepMind28.32%gemini-3-5-flashOfficialDetails
Claude Opus 4.6Anthropic27.06%claude-opus-4-6OfficialDetails
Kimi K2.6Moonshot AI23.89%kimi-k2-6OfficialDetails
GLM-5.1Z.ai17.48%glm-5-1OfficialDetails
Gemini 3.1 Pro PreviewGoogle DeepMind9.88%gemini-3-1-pro-previewOfficialDetails
DeepSeek V4 ProDeepSeek7.52%deepseek-v4-proOfficialDetails
Gemini 3 FlashGoogle DeepMind5.16%gemini-3-flash-previewOfficialDetails
Claude Haiku 4.5Anthropic0.22%claude-haiku-4-5OfficialDetails

Each row reports the model’s % resolved on DeepSWE. Click a row for the full run context.