BenchmarksCoding
Aider Polyglot
A coding benchmark that measures how reliably an LLM can solve and apply diff-based code edits across 225 challenging Exercism exercises spanning C++, Go, Java, JavaScript, Python, and Rust, with up to two attempts per problem.
Coding% correctHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Opus 4.5 | Anthropic | 89.4% | — | Verified | Nov 24, 2025 | Details |
| GPT-5 | OpenAI | 88.0% | — | Official | Aug 7, 2025 | Details |
| OpenAI o3-pro | OpenAI | 84.9% | — | Official | Jun 10, 2025 | Details |
| Gemini 2.5 Pro | Google DeepMind | 83.1% | — | Official | Mar 25, 2025 | Details |
| o3 | OpenAI | 81.3% | — | Official | Apr 16, 2025 | Details |
| Grok 4 | xAI | 79.6% | — | Official | Jul 9, 2025 | Details |
| DeepSeek V3.2 | DeepSeek | 74.2% | — | Official | Dec 1, 2025 | Details |
| Claude Opus 4 | Anthropic | 72.0% | — | Official | May 22, 2025 | Details |
| o4-mini | OpenAI | 72.0% | — | Official | Apr 16, 2025 | Details |
| DeepSeek R1 | DeepSeek | 71.4% | — | Official | Jan 20, 2025 | Details |
| DeepSeek V3.1 | DeepSeek | 68.4% | — | Unverified | Aug 21, 2025 | Details |
| Claude 3.7 Sonnet | Anthropic | 64.9% | — | Official | Feb 24, 2025 | Details |
| Gemini 2.5 Flash | Google DeepMind | 61.9% | — | Unverified | Apr 17, 2025 | Details |
| Qwen 3 Coder 480B | Alibaba / Qwen | 61.8% | — | Unverified | Jul 22, 2025 | Details |
| Claude Sonnet 4 | Anthropic | 61.3% | — | Official | May 22, 2025 | Details |
| Kimi K2 Instruct | Moonshot AI | 60.0% | — | Unverified | Jul 11, 2025 | Details |
| Qwen3 235B A22B Instruct 2507 | Alibaba / Qwen | 57.3% | — | Unverified | Jul 21, 2025 | Details |
| DeepSeek V3 0324 | DeepSeek | 55.1% | — | Official | Mar 24, 2025 | Details |
| GPT-4.1 | OpenAI | 52.4% | — | Official | Apr 14, 2025 | Details |
| Claude 3.5 Sonnet | Anthropic | 51.6% | — | Official | Jun 20, 2024 | Details |
| GPT-OSS-120B | OpenAI | 41.8% | — | Official | Aug 5, 2025 | Details |
| GPT-4o | OpenAI | 23.1% | — | Official | May 13, 2024 | Details |
| Gemini 2.0 Flash | Google DeepMind | 22.2% | — | Official | Dec 11, 2024 | Details |
| Llama 4 Maverick | Meta | 15.6% | — | Official | Apr 5, 2025 | Details |
Each row reports the model’s % correct on Aider Polyglot. Click a row for the full run context.