evals.report
BenchmarksLabsCompareRun guides

SWE-bench Multilingual

A software-engineering benchmark of 300 curated GitHub issue-resolution tasks spanning 42 repositories and 9 programming languages (C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby, Rust), measuring the percentage of real-world issues a model can resolve so that fail-to-pass and pass-to-pass tests succeed.

Coding% resolvedHigher is better

What this benchmark measures

A software-engineering benchmark of 300 curated GitHub issue-resolution tasks spanning 42 repositories and 9 programming languages (C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby, Rust), measuring the percentage of real-world issues a model can resolve so that fail-to-pass and pass-to-pass tests succeed.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is % resolved. It should be interpreted within SWE-bench Multilingual, not compared as part of a site-wide ranking.

No composite ranking
evals.report never combines benchmarks. % resolved on SWE-bench Multilingual is its own number — don’t average it with other metrics.