What does pass rate mean on MCP Atlas?

MCP Atlas reports pass rate (%); higher is better. Scores are shown only within MCP Atlas and are never averaged with other benchmarks.

What is the top reported MCP Atlas score?

Muse Spark 1.1 has the top reported score on MCP Atlas: 88.1% (pass rate).

Why do MCP Atlas scores differ across runs?

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Does evals.report rank models across benchmarks?

No. MCP Atlas scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

BenchmarksTool use

MCP Atlas

Scale AI's large-scale tool-use benchmark: 1,000 expert-written natural-language tasks over 36 real Model Context Protocol (MCP) servers and 220+ tools, requiring agents to discover and orchestrate multi-step tool calls; scored by pass rate via an LLM judge.

Tool usepass rateHigher is better

Scores About Run this benchmark

MCP Atlas runs a tool-use task suite against Dockerized MCP servers (36 servers, 307 tools per README / ~220 advertised) and scores model trajectories with an LLM-as-judge (gemini/gemini-2.5-pro by default) over claims-based rubrics, producing a pass rate. You bring up the MCP server environment in Docker (port 1984), start a LiteLLM-backed completion service that drives the agentic loop for your model (port 3000), generate completions over the 500-task public dataset, then run the scoring script to produce coverage scores and a pass rate. Keep attached to any score: model id, which MCP servers were online (only ~18% of tasks run with the 20 no-key default servers), the judge/evaluator model, the --pass-threshold used, the dataset split (public 500-sample subset, not the full internal benchmark), and the Docker image tag/harness commit.

Benchmark

MCP Atlas

Repository

github.com/scaleapi/mcp-atlas

Dataset

huggingface.co/datasets/ScaleAI/MCP-Atlas

Metric

pass rate

1Install

shell

git clone https://github.com/scaleapi/mcp-atlas.git

shell

cd mcp-atlas

shell

cp env.template .env

shell

# Edit .env: set LLM_API_KEY (model under eval), EVAL_LLM_API_KEY (judge); optional LLM_BASE_URL / EVAL_LLM_BASE_URL; EVAL_LLM_MODEL defaults to gemini/gemini-2.5-pro
# Prerequisites: Docker (allocate >=8GB RAM, 10GB+ recommended), uv, jq, Python 3.10+
docker pull ghcr.io/scaleapi/mcp-atlas:1.2.5

shell

docker tag ghcr.io/scaleapi/mcp-atlas:1.2.5 agent-environment:latest

2Run evaluation

shell

# Terminal 1: start the MCP server environment (port 1984). Alternatively build from source: make build && make run-docker
make run-docker

shell

# Wait for log: 'Uvicorn running on http://0.0.0.0:1984', then verify servers are online (expect total:20,online:20):
curl -s http://localhost:1984/enabled-servers | jq -c

shell

# Terminal 2: start the completion service (port 3000), which manages the agentic LLM<->tools loop
make run-mcp-completion

shell

# Terminal 3: generate completions over the full public dataset (run from services/mcp_eval)
cd services/mcp_eval

shell

uv run python mcp_completion_script.py --model "openai/gpt-5.1" --input_huggingface "ScaleAI/MCP-Atlas" --output "mcp_eval_51_results.csv"

shell

# (Quick smoke test instead: uv run python mcp_completion_script.py --model "openai/gpt-5.1" --input "sample_tasks.csv" --output "sample_51_results.csv")

3Score output

shell

# Run from services/mcp_eval; scores trajectories with the LLM judge and computes pass rate
uv run mcp_evals_scores.py --input-file="completion_results/mcp_eval_51_results.csv" --model-label="gpt51"

4Expected output

The completion run writes a results CSV (e.g. mcp_eval_51_results.csv) under completion_results/ containing both ground truth and model trajectory/completion data. Scoring writes to evaluation_results/: scored_{label}.csv (per-task coverage scores), coverage_stats_{label}.csv (summary statistics, including the headline pass rate = fraction of tasks with coverage_score >= --pass-threshold, default 0.75), and coverage_histogram_{label}.png (score distribution). The pass rate is benchmark-local: do not compare it against tool-use scores from other suites, and note it shifts with how many MCP servers were online and the chosen pass threshold.

5Submit results

Scores are computed locally; the public leaderboard at https://scale.com/leaderboard/mcp_atlas is maintained by Scale AI (no documented self-serve submission). When reporting a number, attach: model id and provider/base URL, the Docker image tag (e.g. ghcr.io/scaleapi/mcp-atlas:1.2.5) or harness commit, which MCP servers were online (run curl http://localhost:1984/enabled-servers; with no API keys only the 20 default servers run, covering ~18% of tasks), the evaluator/judge model (default gemini/gemini-2.5-pro), --pass-threshold used, --concurrency, and that the run used the public 500-sample subset of ScaleAI/MCP-Atlas (not the full internal benchmark).

Gotchas

Coverage depends on API keys: only ~18% of the 500 tasks are runnable with the 20 default no-key servers. Adding keys for exa/airtable/mongodb/oxylabs/brave-search etc. materially changes the achievable pass rate, so a score is only comparable against runs with the same server set online. Five servers (Airtable, Google Calendar/google-workspace, Notion, MongoDB, Slack) also need sample-data uploads per data_exports/README.md or tasks return erroneous results.

The HF dataset is a public 500-sample subset, not the full MCP Atlas benchmark, so locally computed pass rates are not directly comparable to the official leaderboard numbers.

Two long-running services must be up before evaluating: the Docker MCP environment on port 1984 (takes 1+ min; wait for the Uvicorn message and confirm total:20,online:20 via /enabled-servers) and the completion service on port 3000 (make run-mcp-completion). Docker needs >=8GB RAM. If you change LLM_API_KEY, restart make run-mcp-completion; after adding server keys, restart the Docker container and re-run uv run test_servers.py.

Flag names are exact and easy to mix up: local file input is --input, HF dataset input is --input_huggingface (underscore, not dash); scoring uses --input-file and --model-label (dashes). The scoring script is mcp_evals_scores.py (plural 'evals'); steps 5-8 (including completion and scoring) are run from services/mcp_eval. Completion auto-skips tasks already present in the output CSV, so delete/rename it to force a full re-run.