MCP Atlas
Scale AI's large-scale tool-use benchmark: 1,000 expert-written natural-language tasks over 36 real Model Context Protocol (MCP) servers and 220+ tools, requiring agents to discover and orchestrate multi-step tool calls; scored by pass rate via an LLM judge.
What this benchmark measures
Scale AI's large-scale tool-use benchmark: 1,000 expert-written natural-language tasks over 36 real Model Context Protocol (MCP) servers and 220+ tools, requiring agents to discover and orchestrate multi-step tool calls; scored by pass rate via an LLM judge.
Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.
The metric shown here is pass rate. It should be interpreted within MCP Atlas, not compared as part of a site-wide ranking.
What to be careful about
Pass rate uses an LLM judge (default Gemini 2.5 Pro); MiniMax's number is a self-reported Public Set run, distinct from Scale's official leaderboard.