What does accuracy mean on Berkeley Function Calling Leaderboard?

Berkeley Function Calling Leaderboard reports accuracy (%); higher is better. Scores are shown only within Berkeley Function Calling Leaderboard and are never averaged with other benchmarks.

What is the top reported Berkeley Function Calling Leaderboard score?

Claude Opus 4.5 has the top reported score on Berkeley Function Calling Leaderboard: 77.47% (accuracy).

Why do Berkeley Function Calling Leaderboard scores differ across runs?

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Does evals.report rank models across benchmarks?

No. Berkeley Function Calling Leaderboard scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

BenchmarksTool use

Berkeley Function Calling Leaderboard

A function-calling and tool-use benchmark covering single-turn, multi-turn, live, and agentic scenarios.

Tool useaccuracyHigher is better

Scores About Run this benchmark

BFCL is run via the official `bfcl-eval` Python package (the berkeley-function-call-leaderboard subdirectory of ShishirPatil/gorilla). You install it, set BFCL_PROJECT_ROOT and API keys (or pick a vLLM/SGLang backend for local OSS models), run `bfcl generate` to produce model responses, then `bfcl evaluate` to grade them with AST and executable checks, producing per-category accuracy JSON files and CSV summaries. When reporting a score, keep attached the exact bfcl-eval version/gorilla commit, the model name (the `-FC` suffix denotes native function-calling mode vs prompt-based), and the test categories evaluated, since overall accuracy mixes single-turn, live, and multi-turn subsets.

Benchmark

Berkeley Function Calling Leaderboard

Repository

github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard

Dataset

huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard

Metric

accuracy

1Install

shell

conda create -n BFCL python=3.10

shell

conda activate BFCL

shell

pip install bfcl-eval

shell

# For local OSS models instead, install from source with a backend extra:
# git clone https://github.com/ShishirPatil/gorilla.git
# cd gorilla/berkeley-function-call-leaderboard && pip install -e .[oss_eval_vllm]   (or .[oss_eval_sglang])
export BFCL_PROJECT_ROOT=/path/to/your/desired/project/directory

shell

cp bfcl_eval/.env.example .env   # then fill in API keys (OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.)

2Run evaluation

shell

bfcl generate --model MODEL_NAME --test-category TEST_CATEGORY --num-threads 1

shell

# Example with multiple models/categories:
bfcl generate --model claude-3-5-sonnet-20241022-FC,gpt-4o-2024-11-20-FC --test-category simple_python,parallel,live_multiple,multi_turn

shell

# For a local OSS model served via vLLM/SGLang:
bfcl generate --model MODEL_NAME --test-category TEST_CATEGORY --backend vllm --num-gpus 1 --gpu-memory-utilization 0.9

3Score output

shell

bfcl evaluate --model MODEL_NAME --test-category TEST_CATEGORY

shell

# To score only the subset present in the result file:
bfcl evaluate --model MODEL_NAME --test-category TEST_CATEGORY --partial-eval

4Expected output

`bfcl generate` writes model responses to result/MODEL_NAME/BFCL_v3_TEST_CATEGORY_result.json. `bfcl evaluate` writes per-category scores to score/MODEL_NAME/BFCL_v3_TEST_CATEGORY_score.json and produces four CSV summaries under ./score/ (data_overall.csv, data_live.csv, data_non_live.csv, data_multi_turn.csv), with accuracy as the headline metric. Do not mix accuracy across different BFCL versions (V1/V2/V3/V4) or different test-category subsets.

5Submit results

Local scores are produced entirely by the harness; no submission is required to reproduce a number. To appear on the public leaderboard at https://gorilla.cs.berkeley.edu/leaderboard.html you add your model to the harness and open a PR against ShishirPatil/gorilla (per SUPPORTED_MODELS.md). When reporting a score elsewhere, attach: the bfcl-eval version (or gorilla commit), the exact --model string (including any -FC suffix), the --test-category set evaluated, and the backend used for OSS models.

Gotchas

Install the correct package: `pip install bfcl-eval`, NOT the unrelated `bfcl` package on PyPI (the README explicitly warns: 'Be careful not to confuse with the unrelated bfcl project on PyPI!').

Model name suffix matters: a `-FC` suffix selects the model's native function-calling/tool API, while the bare name uses prompt-based function calling — they are scored as different entries (see SUPPORTED_MODELS.md).

The headline 'accuracy' aggregates very different subsets (single-turn AST, executable, live, irrelevance, multi-turn). Always report which --test-category was run; `--partial-eval` silently scores only IDs present in the result file, so an incomplete generate run can mislead the number.

You must set BFCL_PROJECT_ROOT and provide a populated .env with the relevant API keys before `bfcl generate`; executable test categories also require working network/runtime for the called functions, and OSS models require the vLLM or SGLang extra plus GPUs.