BenchmarksCoding
BigCodeBench
A benchmark of 1,140 (Full) / 148 (Hard) function-level Python programming tasks requiring models to compose calls across 139 diverse libraries from complex instructions, scored by calibrated Pass@1 with greedy decoding.
Codingcalibrated Pass@1Higher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| DeepSeek V3 | DeepSeek | 40.5% | — | Verified | Dec 26, 2024 | Details |
| DeepSeek R1 | DeepSeek | 40.5% | — | Verified | Jan 20, 2025 | Details |
| Gemini 2.5 Pro | Google DeepMind | 36.5% | — | Verified | Mar 25, 2025 | Details |
| DeepSeek V3 0324 | DeepSeek | 35.8% | — | Verified | Mar 24, 2025 | Details |
| Claude 3.5 Sonnet | Anthropic | 35.1% | — | Verified | Jun 20, 2024 | Details |
| GPT-4o | OpenAI | 34.5% | — | Verified | May 13, 2024 | Details |
| GPT-4.1 | OpenAI | 33.8% | — | Verified | Apr 14, 2025 | Details |
| Claude 3.7 Sonnet | Anthropic | 33.8% | — | Verified | Feb 24, 2025 | Details |
| Gemini 2.0 Flash | Google DeepMind | 33.8% | — | Verified | Dec 11, 2024 | Details |
| Gemini 1.5 Pro | Google DeepMind | 32.4% | — | Verified | Feb 15, 2024 | Details |
| Llama 3.1 405B | Meta | 30.4% | — | Verified | Jul 23, 2024 | Details |
| Mistral Large | Mistral AI | 29.7% | — | Verified | Feb 26, 2024 | Details |
| Llama 4 Maverick | Meta | 29.1% | — | Verified | Apr 5, 2025 | Details |
| Llama 4 Scout | Meta | 16.9% | — | Verified | Apr 5, 2025 | Details |
Each row reports the model’s calibrated Pass@1 on BigCodeBench. Click a row for the full run context.