evals.report
BenchmarksLabsCompareRun guides

BigCodeBench

A benchmark of 1,140 (Full) / 148 (Hard) function-level Python programming tasks requiring models to compose calls across 139 diverse libraries from complex instructions, scored by calibrated Pass@1 with greedy decoding.

Codingcalibrated Pass@1Higher is better
ModelLabScoreSource modelStatusDate
DeepSeek V3DeepSeek40.5%VerifiedDec 26, 2024Details
DeepSeek R1DeepSeek40.5%VerifiedJan 20, 2025Details
Gemini 2.5 ProGoogle DeepMind36.5%VerifiedMar 25, 2025Details
DeepSeek V3 0324DeepSeek35.8%VerifiedMar 24, 2025Details
Claude 3.5 SonnetAnthropic35.1%VerifiedJun 20, 2024Details
GPT-4oOpenAI34.5%VerifiedMay 13, 2024Details
GPT-4.1OpenAI33.8%VerifiedApr 14, 2025Details
Claude 3.7 SonnetAnthropic33.8%VerifiedFeb 24, 2025Details
Gemini 2.0 FlashGoogle DeepMind33.8%VerifiedDec 11, 2024Details
Gemini 1.5 ProGoogle DeepMind32.4%VerifiedFeb 15, 2024Details
Llama 3.1 405BMeta30.4%VerifiedJul 23, 2024Details
Mistral LargeMistral AI29.7%VerifiedFeb 26, 2024Details
Llama 4 MaverickMeta29.1%VerifiedApr 5, 2025Details
Llama 4 ScoutMeta16.9%VerifiedApr 5, 2025Details

Each row reports the model’s calibrated Pass@1 on BigCodeBench. Click a row for the full run context.