Run MCP Atlas
The same run guide is also available from the benchmark detail page.
MCP Atlas runs a tool-use task suite against Dockerized MCP servers (36 servers, 307 tools per README / ~220 advertised) and scores model trajectories with an LLM-as-judge (gemini/gemini-2.5-pro by default) over claims-based rubrics, producing a pass rate. You bring up the MCP server environment in Docker (port 1984), start a LiteLLM-backed completion service that drives the agentic loop for your model (port 3000), generate completions over the 500-task public dataset, then run the scoring script to produce coverage scores and a pass rate. Keep attached to any score: model id, which MCP servers were online (only ~18% of tasks run with the 20 no-key default servers), the judge/evaluator model, the --pass-threshold used, the dataset split (public 500-sample subset, not the full internal benchmark), and the Docker image tag/harness commit.
1Install
git clone https://github.com/scaleapi/mcp-atlas.gitcd mcp-atlascp env.template .env# Edit .env: set LLM_API_KEY (model under eval), EVAL_LLM_API_KEY (judge); optional LLM_BASE_URL / EVAL_LLM_BASE_URL; EVAL_LLM_MODEL defaults to gemini/gemini-2.5-pro
# Prerequisites: Docker (allocate >=8GB RAM, 10GB+ recommended), uv, jq, Python 3.10+
docker pull ghcr.io/scaleapi/mcp-atlas:1.2.5docker tag ghcr.io/scaleapi/mcp-atlas:1.2.5 agent-environment:latest2Run evaluation
# Terminal 1: start the MCP server environment (port 1984). Alternatively build from source: make build && make run-docker
make run-docker# Wait for log: 'Uvicorn running on http://0.0.0.0:1984', then verify servers are online (expect total:20,online:20):
curl -s http://localhost:1984/enabled-servers | jq -c# Terminal 2: start the completion service (port 3000), which manages the agentic LLM<->tools loop
make run-mcp-completion# Terminal 3: generate completions over the full public dataset (run from services/mcp_eval)
cd services/mcp_evaluv run python mcp_completion_script.py --model "openai/gpt-5.1" --input_huggingface "ScaleAI/MCP-Atlas" --output "mcp_eval_51_results.csv"# (Quick smoke test instead: uv run python mcp_completion_script.py --model "openai/gpt-5.1" --input "sample_tasks.csv" --output "sample_51_results.csv")3Score output
# Run from services/mcp_eval; scores trajectories with the LLM judge and computes pass rate
uv run mcp_evals_scores.py --input-file="completion_results/mcp_eval_51_results.csv" --model-label="gpt51"4Expected output
The completion run writes a results CSV (e.g. mcp_eval_51_results.csv) under completion_results/ containing both ground truth and model trajectory/completion data. Scoring writes to evaluation_results/: scored_{label}.csv (per-task coverage scores), coverage_stats_{label}.csv (summary statistics, including the headline pass rate = fraction of tasks with coverage_score >= --pass-threshold, default 0.75), and coverage_histogram_{label}.png (score distribution). The pass rate is benchmark-local: do not compare it against tool-use scores from other suites, and note it shifts with how many MCP servers were online and the chosen pass threshold.
5Submit results
Scores are computed locally; the public leaderboard at https://scale.com/leaderboard/mcp_atlas is maintained by Scale AI (no documented self-serve submission). When reporting a number, attach: model id and provider/base URL, the Docker image tag (e.g. ghcr.io/scaleapi/mcp-atlas:1.2.5) or harness commit, which MCP servers were online (run curl http://localhost:1984/enabled-servers; with no API keys only the 20 default servers run, covering ~18% of tasks), the evaluator/judge model (default gemini/gemini-2.5-pro), --pass-threshold used, --concurrency, and that the run used the public 500-sample subset of ScaleAI/MCP-Atlas (not the full internal benchmark).