How to run Berkeley Function Calling Leaderboard — benchmark guide

Run guidesTool use

BFCL is run via the official `bfcl-eval` Python package (the berkeley-function-call-leaderboard subdirectory of ShishirPatil/gorilla). You install it, set BFCL_PROJECT_ROOT and API keys (or pick a vLLM/SGLang backend for local OSS models), run `bfcl generate` to produce model responses, then `bfcl evaluate` to grade them with AST and executable checks, producing per-category accuracy JSON files and CSV summaries. When reporting a score, keep attached the exact bfcl-eval version/gorilla commit, the model name (the `-FC` suffix denotes native function-calling mode vs prompt-based), and the test categories evaluated, since overall accuracy mixes single-turn, live, and multi-turn subsets.

Benchmark

Berkeley Function Calling Leaderboard

Repository

github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard

Dataset

huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard

Metric

accuracy

1Install

shell

conda create -n BFCL python=3.10

shell

conda activate BFCL

shell

pip install bfcl-eval

shell

# For local OSS models instead, install from source with a backend extra:
# git clone https://github.com/ShishirPatil/gorilla.git
# cd gorilla/berkeley-function-call-leaderboard && pip install -e .[oss_eval_vllm]   (or .[oss_eval_sglang])
export BFCL_PROJECT_ROOT=/path/to/your/desired/project/directory

shell

cp bfcl_eval/.env.example .env   # then fill in API keys (OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.)

2Run evaluation

shell

bfcl generate --model MODEL_NAME --test-category TEST_CATEGORY --num-threads 1

shell

# Example with multiple models/categories:
bfcl generate --model claude-3-5-sonnet-20241022-FC,gpt-4o-2024-11-20-FC --test-category simple_python,parallel,live_multiple,multi_turn

shell

# For a local OSS model served via vLLM/SGLang:
bfcl generate --model MODEL_NAME --test-category TEST_CATEGORY --backend vllm --num-gpus 1 --gpu-memory-utilization 0.9

3Score output

shell

bfcl evaluate --model MODEL_NAME --test-category TEST_CATEGORY

shell

# To score only the subset present in the result file:
bfcl evaluate --model MODEL_NAME --test-category TEST_CATEGORY --partial-eval

4Expected output

`bfcl generate` writes model responses to result/MODEL_NAME/BFCL_v3_TEST_CATEGORY_result.json. `bfcl evaluate` writes per-category scores to score/MODEL_NAME/BFCL_v3_TEST_CATEGORY_score.json and produces four CSV summaries under ./score/ (data_overall.csv, data_live.csv, data_non_live.csv, data_multi_turn.csv), with accuracy as the headline metric. Do not mix accuracy across different BFCL versions (V1/V2/V3/V4) or different test-category subsets.

5Submit results

Local scores are produced entirely by the harness; no submission is required to reproduce a number. To appear on the public leaderboard at https://gorilla.cs.berkeley.edu/leaderboard.html you add your model to the harness and open a PR against ShishirPatil/gorilla (per SUPPORTED_MODELS.md). When reporting a score elsewhere, attach: the bfcl-eval version (or gorilla commit), the exact --model string (including any -FC suffix), the --test-category set evaluated, and the backend used for OSS models.

Gotchas

Install the correct package: `pip install bfcl-eval`, NOT the unrelated `bfcl` package on PyPI (the README explicitly warns: 'Be careful not to confuse with the unrelated bfcl project on PyPI!').

Model name suffix matters: a `-FC` suffix selects the model's native function-calling/tool API, while the bare name uses prompt-based function calling — they are scored as different entries (see SUPPORTED_MODELS.md).

The headline 'accuracy' aggregates very different subsets (single-turn AST, executable, live, irrelevance, multi-turn). Always report which --test-category was run; `--partial-eval` silently scores only IDs present in the result file, so an incomplete generate run can mislead the number.

You must set BFCL_PROJECT_ROOT and provide a populated .env with the relevant API keys before `bfcl generate`; executable test categories also require working network/runtime for the called functions, and OSS models require the vLLM or SGLang extra plus GPUs.