evals.report
BenchmarksLabsCompareRun guides
BenchmarksTool use

MCP Atlas

Scale AI's large-scale tool-use benchmark: 1,000 expert-written natural-language tasks over 36 real Model Context Protocol (MCP) servers and 220+ tools, requiring agents to discover and orchestrate multi-step tool calls; scored by pass rate via an LLM judge.

Tool usepass rateHigher is better
ModelLabScoreSource modelStatusDate
MiniMax M3MiniMax74.2%MiniMax M3VerifiedDetails
Claude Opus 4.5Anthropic62.3%Claude Opus 4.5OfficialDetails
Gemini 3 ProGoogle DeepMind54.1%Gemini 3 ProOfficialDetails
GPT-5OpenAI44.5%GPT-5OfficialDetails
Claude Sonnet 4.5Anthropic43.8%Claude Sonnet 4.5OfficialDetails
Claude Opus 4.1Anthropic40.9%Claude Opus 4.1OfficialDetails
Claude Sonnet 4Anthropic35.6%Claude Sonnet 4OfficialDetails
Kimi K2 InstructMoonshot AI23.9%Kimi K2 InstructOfficialDetails
Qwen3 235B A22B Instruct 2507Alibaba / Qwen12.0%Qwen3-235B-A22BOfficialDetails
Gemini 2.5 ProGoogle DeepMind8.8%Gemini 2.5 ProOfficialDetails
GPT-4oOpenAI7.2%GPT-4oOfficialDetails
Gemini 2.5 FlashGoogle DeepMind3.4%Gemini 2.5 FlashOfficialDetails

Each row reports the model’s pass rate on MCP Atlas. Click a row for the full run context.