BenchmarksCoding
BigCodeBench
A benchmark of 1,140 (Full) / 148 (Hard) function-level Python programming tasks requiring models to compose calls across 139 diverse libraries from complex instructions, scored by calibrated Pass@1 with greedy decoding.
Codingcalibrated Pass@1Higher is better
No run guide for this benchmark yet.