BenchmarksTool use
MCP Atlas
Scale AI's large-scale tool-use benchmark: 1,000 expert-written natural-language tasks over 36 real Model Context Protocol (MCP) servers and 220+ tools, requiring agents to discover and orchestrate multi-step tool calls; scored by pass rate via an LLM judge.
Tool usepass rateHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| MiniMax M3 | MiniMax | 74.2% | MiniMax M3 | Verified | — | Details |
| Claude Opus 4.5 | Anthropic | 62.3% | Claude Opus 4.5 | Official | — | Details |
| Gemini 3 Pro | Google DeepMind | 54.1% | Gemini 3 Pro | Official | — | Details |
| GPT-5 | OpenAI | 44.5% | GPT-5 | Official | — | Details |
| Claude Sonnet 4.5 | Anthropic | 43.8% | Claude Sonnet 4.5 | Official | — | Details |
| Claude Opus 4.1 | Anthropic | 40.9% | Claude Opus 4.1 | Official | — | Details |
| Claude Sonnet 4 | Anthropic | 35.6% | Claude Sonnet 4 | Official | — | Details |
| Kimi K2 Instruct | Moonshot AI | 23.9% | Kimi K2 Instruct | Official | — | Details |
| Qwen3 235B A22B Instruct 2507 | Alibaba / Qwen | 12.0% | Qwen3-235B-A22B | Official | — | Details |
| Gemini 2.5 Pro | Google DeepMind | 8.8% | Gemini 2.5 Pro | Official | — | Details |
| GPT-4o | OpenAI | 7.2% | GPT-4o | Official | — | Details |
| Gemini 2.5 Flash | Google DeepMind | 3.4% | Gemini 2.5 Flash | Official | — | Details |
Each row reports the model’s pass rate on MCP Atlas. Click a row for the full run context.