evals.report
BenchmarksLabsCompareRun guides

Aider Polyglot

A coding benchmark that measures how reliably an LLM can solve and apply diff-based code edits across 225 challenging Exercism exercises spanning C++, Go, Java, JavaScript, Python, and Rust, with up to two attempts per problem.

Coding% correctHigher is better

No run guide for this benchmark yet.