evals.report
BenchmarksSourcesLabsCompareRun guides
BenchmarksTool use

Berkeley Function Calling Leaderboard

A function-calling and tool-use benchmark covering single-turn, multi-turn, live, and agentic scenarios.

Tool useaccuracyHigher is better

What this benchmark measures

A function-calling and tool-use benchmark covering single-turn, multi-turn, live, and agentic scenarios.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is accuracy. It should be interpreted within Berkeley Function Calling Leaderboard and the UC Berkeley Gorilla source context, not compared as part of a site-wide ranking.

What to be careful about

BFCL includes source-provided within-benchmark aggregates; label them as BFCL metrics, never evals.report composites.

No composite ranking
evals.report never combines benchmarks. accuracy on Berkeley Function Calling Leaderboard is its own number — don’t average it with other metrics.