evals.report
BenchmarksLabsCompareRun guides

Aider Polyglot

A coding benchmark that measures how reliably an LLM can solve and apply diff-based code edits across 225 challenging Exercism exercises spanning C++, Go, Java, JavaScript, Python, and Rust, with up to two attempts per problem.

Coding% correctHigher is better

What this benchmark measures

A coding benchmark that measures how reliably an LLM can solve and apply diff-based code edits across 225 challenging Exercism exercises spanning C++, Go, Java, JavaScript, Python, and Rust, with up to two attempts per problem.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is % correct. It should be interpreted within Aider Polyglot, not compared as part of a site-wide ranking.

No composite ranking
evals.report never combines benchmarks. % correct on Aider Polyglot is its own number — don’t average it with other metrics.