evals.report
BenchmarksLabsCompareRun guides

SWE-bench Multilingual

A software-engineering benchmark of 300 curated GitHub issue-resolution tasks spanning 42 repositories and 9 programming languages (C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby, Rust), measuring the percentage of real-world issues a model can resolve so that fail-to-pass and pass-to-pass tests succeed.

Coding% resolvedHigher is better
ModelLabScoreSource modelStatusDate
Gemini 3 FlashGoogle DeepMind72.7%OfficialDec 17, 2025Details
Claude Opus 4.6Anthropic72.0%OfficialFeb 5, 2026Details
Claude Opus 4.5Anthropic70.7%OfficialNov 24, 2025Details
GLM-5Z.ai69.7%OfficialFeb 11, 2026Details
Gemini 3 ProGoogle DeepMind68.7%OfficialNov 18, 2025Details
MiniMax M2.5MiniMax68.3%OfficialFeb 12, 2026Details
Kimi K2.5Moonshot AI67.3%OfficialJan 27, 2026Details
Claude Sonnet 4.5Anthropic67.0%OfficialSep 29, 2025Details
GPT-5.2OpenAI66.7%OfficialDec 11, 2025Details
GPT-5.2-CodexOpenAI66.3%OfficialDec 18, 2025Details
Claude Haiku 4.5Anthropic64.7%OfficialOct 15, 2025Details
DeepSeek V3.2DeepSeek59.0%OfficialDec 1, 2025Details
Claude 3.7 SonnetAnthropic43%VerifiedFeb 24, 2025Details
GPT-5 miniOpenAI39.7%OfficialAug 7, 2025Details

Each row reports the model’s % resolved on SWE-bench Multilingual. Click a row for the full run context.