evals.report
BenchmarksSourcesLabsCompareRun guides
BenchmarksTool use

Berkeley Function Calling Leaderboard

A function-calling and tool-use benchmark covering single-turn, multi-turn, live, and agentic scenarios.

Tool useaccuracyHigher is better

Known official sources 1

Ready nowResult archiveReview neededRun guide readyPublic data

Berkeley Function Calling Leaderboard

Strong public benchmark for function calling, multi-turn, live, and agentic tool categories.

Category
Tool use
Owner
UC Berkeley Gorilla
Data path
Use the latest dated result archive after matching it to the public leaderboard. Prefer category rows first.
Known caveat
BFCL includes source-provided within-benchmark aggregates; label them as BFCL metrics, never evals.report composites.