evals.report
BenchmarksLabsCompareRun guides

Aider Polyglot

A coding benchmark that measures how reliably an LLM can solve and apply diff-based code edits across 225 challenging Exercism exercises spanning C++, Go, Java, JavaScript, Python, and Rust, with up to two attempts per problem.

Coding% correctHigher is better
ModelLabScoreSource modelStatusDate
Claude Opus 4.5Anthropic89.4%VerifiedNov 24, 2025Details
GPT-5OpenAI88.0%OfficialAug 7, 2025Details
OpenAI o3-proOpenAI84.9%OfficialJun 10, 2025Details
Gemini 2.5 ProGoogle DeepMind83.1%OfficialMar 25, 2025Details
o3OpenAI81.3%OfficialApr 16, 2025Details
Grok 4xAI79.6%OfficialJul 9, 2025Details
DeepSeek V3.2DeepSeek74.2%OfficialDec 1, 2025Details
Claude Opus 4Anthropic72.0%OfficialMay 22, 2025Details
o4-miniOpenAI72.0%OfficialApr 16, 2025Details
DeepSeek R1DeepSeek71.4%OfficialJan 20, 2025Details
DeepSeek V3.1DeepSeek68.4%UnverifiedAug 21, 2025Details
Claude 3.7 SonnetAnthropic64.9%OfficialFeb 24, 2025Details
Gemini 2.5 FlashGoogle DeepMind61.9%UnverifiedApr 17, 2025Details
Qwen 3 Coder 480BAlibaba / Qwen61.8%UnverifiedJul 22, 2025Details
Claude Sonnet 4Anthropic61.3%OfficialMay 22, 2025Details
Kimi K2 InstructMoonshot AI60.0%UnverifiedJul 11, 2025Details
Qwen3 235B A22B Instruct 2507Alibaba / Qwen57.3%UnverifiedJul 21, 2025Details
DeepSeek V3 0324DeepSeek55.1%OfficialMar 24, 2025Details
GPT-4.1OpenAI52.4%OfficialApr 14, 2025Details
Claude 3.5 SonnetAnthropic51.6%OfficialJun 20, 2024Details
GPT-OSS-120BOpenAI41.8%OfficialAug 5, 2025Details
GPT-4oOpenAI23.1%OfficialMay 13, 2024Details
Gemini 2.0 FlashGoogle DeepMind22.2%OfficialDec 11, 2024Details
Llama 4 MaverickMeta15.6%OfficialApr 5, 2025Details

Each row reports the model’s % correct on Aider Polyglot. Click a row for the full run context.