evals.report
BenchmarksSourcesLabsCompareRun guides
SourcesTool use

Berkeley Function Calling Leaderboard

Strong public benchmark for function calling, multi-turn, live, and agentic tool categories.

Ready nowResult archiveReview neededRun guide readyPublic data
Official source Benchmark page

Source detail

Score source

Harness writes score files and CSVs; public dated BFCL-Result archive contains score/result JSON.

Run guide

Official BFCL README documents install, generation, evaluation, and score output.

How it can be used

Use the latest dated result archive after matching it to the public leaderboard. Prefer category rows first.

Caveat

BFCL includes source-provided within-benchmark aggregates; label them as BFCL metrics, never evals.report composites.

Evidence links 3