evals.report
BenchmarksLabsCompareRun guides
BenchmarksTool use

MCP Atlas

Scale AI's large-scale tool-use benchmark: 1,000 expert-written natural-language tasks over 36 real Model Context Protocol (MCP) servers and 220+ tools, requiring agents to discover and orchestrate multi-step tool calls; scored by pass rate via an LLM judge.

Tool usepass rateHigher is better

What this benchmark measures

Scale AI's large-scale tool-use benchmark: 1,000 expert-written natural-language tasks over 36 real Model Context Protocol (MCP) servers and 220+ tools, requiring agents to discover and orchestrate multi-step tool calls; scored by pass rate via an LLM judge.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is pass rate. It should be interpreted within MCP Atlas, not compared as part of a site-wide ranking.

What to be careful about

Pass rate uses an LLM judge (default Gemini 2.5 Pro); MiniMax's number is a self-reported Public Set run, distinct from Scale's official leaderboard.

No composite ranking
evals.report never combines benchmarks. pass rate on MCP Atlas is its own number — don’t average it with other metrics.