evals.report
BenchmarksLabsCompareRun guides

BigCodeBench

A benchmark of 1,140 (Full) / 148 (Hard) function-level Python programming tasks requiring models to compose calls across 139 diverse libraries from complex instructions, scored by calibrated Pass@1 with greedy decoding.

Codingcalibrated Pass@1Higher is better

No run guide for this benchmark yet.