BenchmarksTool use
Berkeley Function Calling Leaderboard
A function-calling and tool-use benchmark covering single-turn, multi-turn, live, and agentic scenarios.
Tool useaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Opus 4.5 | Anthropic | 77.47% | Claude-Opus-4-5-20251101 (FC) | Official | Apr 12, 2026 | Details |
| Claude Sonnet 4.5 | Anthropic | 73.24% | Claude-Sonnet-4-5-20250929 (FC) | Official | Apr 12, 2026 | Details |
| Gemini 3 Pro | Google DeepMind | 72.51% | Gemini-3-Pro-Preview (Prompt) | Official | Apr 12, 2026 | Details |
| GLM-4.6 | Z.ai | 72.38% | GLM-4.6 (FC thinking) | Official | Apr 12, 2026 | Details |
| Grok 4.1 fast reasoning | xAI | 69.57% | Grok-4-1-fast-reasoning (FC) | Official | Apr 12, 2026 | Details |
| Claude Haiku 4.5 | Anthropic | 68.70% | Claude-Haiku-4-5-20251001 (FC) | Official | Apr 12, 2026 | Details |
| o3 | OpenAI | 63.05% | o3-2025-04-16 (Prompt) | Official | Apr 12, 2026 | Details |
| Grok 4 | xAI | 62.97% | Grok-4-0709 (Prompt) | Official | Apr 12, 2026 | Details |
| Kimi K2 Instruct | Moonshot AI | 59.06% | Moonshotai-Kimi-K2-Instruct (FC) | Official | Apr 12, 2026 | Details |
| Command A Reasoning | Cohere | 57.06% | Command A Reasoning (FC) | Official | Apr 12, 2026 | Details |
| DeepSeek V3.2 | DeepSeek | 56.73% | DeepSeek-V3.2-Exp (Prompt + Thinking) | Official | Apr 12, 2026 | Details |
| Gemini 2.5 Flash | Google DeepMind | 56.24% | Gemini-2.5-Flash (FC) | Official | Apr 12, 2026 | Details |
| GPT-5.2 | OpenAI | 55.87% | GPT-5.2-2025-12-11 (FC) | Official | Apr 12, 2026 | Details |
| GPT-5 mini | OpenAI | 55.46% | GPT-5-mini-2025-08-07 (FC) | Official | Apr 12, 2026 | Details |
| GPT-4.1 | OpenAI | 53.96% | GPT-4.1-2025-04-14 (FC) | Official | Apr 12, 2026 | Details |
| o4-mini | OpenAI | 53.24% | o4-mini-2025-04-16 (FC) | Official | Apr 12, 2026 | Details |
| Qwen3 235B A22B Instruct 2507 | Alibaba / Qwen | 52.15% | Qwen3-235B-A22B-Instruct-2507 (Prompt) | Official | Apr 12, 2026 | Details |
| Mistral Large | Mistral AI | 38.37% | mistral-large-2411 (FC) | Official | Apr 12, 2026 | Details |
| Llama 4 Maverick | Meta | 37.29% | Llama-4-Maverick-17B-128E-Instruct-FP8 (FC) | Official | Apr 12, 2026 | Details |
Each row reports the model’s accuracy on Berkeley Function Calling Leaderboard. Click a row for the full run context.