evals.report
BenchmarksLabsCompareRun guides

BigCodeBench

A benchmark of 1,140 (Full) / 148 (Hard) function-level Python programming tasks requiring models to compose calls across 139 diverse libraries from complex instructions, scored by calibrated Pass@1 with greedy decoding.

Codingcalibrated Pass@1Higher is better

What this benchmark measures

A benchmark of 1,140 (Full) / 148 (Hard) function-level Python programming tasks requiring models to compose calls across 139 diverse libraries from complex instructions, scored by calibrated Pass@1 with greedy decoding.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is calibrated Pass@1. It should be interpreted within BigCodeBench, not compared as part of a site-wide ranking.

No composite ranking
evals.report never combines benchmarks. calibrated Pass@1 on BigCodeBench is its own number — don’t average it with other metrics.