evals.report
BenchmarksLabsCompareRun guides

SWE-bench Multilingual

A software-engineering benchmark of 300 curated GitHub issue-resolution tasks spanning 42 repositories and 9 programming languages (C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby, Rust), measuring the percentage of real-world issues a model can resolve so that fail-to-pass and pass-to-pass tests succeed.

Coding% resolvedHigher is better

No run guide for this benchmark yet.