evals.report
BenchmarksLabsCompareRun guides
BenchmarksTool use

MCP Atlas

Scale AI's large-scale tool-use benchmark: 1,000 expert-written natural-language tasks over 36 real Model Context Protocol (MCP) servers and 220+ tools, requiring agents to discover and orchestrate multi-step tool calls; scored by pass rate via an LLM judge.

Tool usepass rateHigher is better

No run guide for this benchmark yet.