evals.report
BenchmarksSourcesLabsCompareRun guides
BenchmarksTool use

Berkeley Function Calling Leaderboard

A function-calling and tool-use benchmark covering single-turn, multi-turn, live, and agentic scenarios.

Tool useaccuracyHigher is better
ModelLabScoreSource modelStatusDate
Claude Opus 4.5Anthropic77.47%Claude-Opus-4-5-20251101 (FC)OfficialApr 12, 2026Details
Claude Sonnet 4.5Anthropic73.24%Claude-Sonnet-4-5-20250929 (FC)OfficialApr 12, 2026Details
Gemini 3 ProGoogle DeepMind72.51%Gemini-3-Pro-Preview (Prompt)OfficialApr 12, 2026Details
GLM-4.6Z.ai72.38%GLM-4.6 (FC thinking)OfficialApr 12, 2026Details
Grok 4.1 fast reasoningxAI69.57%Grok-4-1-fast-reasoning (FC)OfficialApr 12, 2026Details
Claude Haiku 4.5Anthropic68.70%Claude-Haiku-4-5-20251001 (FC)OfficialApr 12, 2026Details
o3OpenAI63.05%o3-2025-04-16 (Prompt)OfficialApr 12, 2026Details
Grok 4xAI62.97%Grok-4-0709 (Prompt)OfficialApr 12, 2026Details
Kimi K2 InstructMoonshot AI59.06%Moonshotai-Kimi-K2-Instruct (FC)OfficialApr 12, 2026Details
Command A ReasoningCohere57.06%Command A Reasoning (FC)OfficialApr 12, 2026Details
DeepSeek V3.2DeepSeek56.73%DeepSeek-V3.2-Exp (Prompt + Thinking)OfficialApr 12, 2026Details
Gemini 2.5 FlashGoogle DeepMind56.24%Gemini-2.5-Flash (FC)OfficialApr 12, 2026Details
GPT-5.2OpenAI55.87%GPT-5.2-2025-12-11 (FC)OfficialApr 12, 2026Details
GPT-5 miniOpenAI55.46%GPT-5-mini-2025-08-07 (FC)OfficialApr 12, 2026Details
GPT-4.1OpenAI53.96%GPT-4.1-2025-04-14 (FC)OfficialApr 12, 2026Details
o4-miniOpenAI53.24%o4-mini-2025-04-16 (FC)OfficialApr 12, 2026Details
Qwen3 235B A22B Instruct 2507Alibaba / Qwen52.15%Qwen3-235B-A22B-Instruct-2507 (Prompt)OfficialApr 12, 2026Details
Mistral LargeMistral AI38.37%mistral-large-2411 (FC)OfficialApr 12, 2026Details
Llama 4 MaverickMeta37.29%Llama-4-Maverick-17B-128E-Instruct-FP8 (FC)OfficialApr 12, 2026Details

Each row reports the model’s accuracy on Berkeley Function Calling Leaderboard. Click a row for the full run context.