evals.report
BenchmarksLabsCompareRun guides
Run guidesTool use

MCP Atlas runs a tool-use task suite against Dockerized MCP servers (36 servers, 307 tools per README / ~220 advertised) and scores model trajectories with an LLM-as-judge (gemini/gemini-2.5-pro by default) over claims-based rubrics, producing a pass rate. You bring up the MCP server environment in Docker (port 1984), start a LiteLLM-backed completion service that drives the agentic loop for your model (port 3000), generate completions over the 500-task public dataset, then run the scoring script to produce coverage scores and a pass rate. Keep attached to any score: model id, which MCP servers were online (only ~18% of tasks run with the 20 no-key default servers), the judge/evaluator model, the --pass-threshold used, the dataset split (public 500-sample subset, not the full internal benchmark), and the Docker image tag/harness commit.

Benchmark
MCP Atlas
Dataset
huggingface.co/datasets/ScaleAI/MCP-Atlas
Metric
pass rate

1Install

shell
git clone https://github.com/scaleapi/mcp-atlas.git
shell
cd mcp-atlas
shell
cp env.template .env
shell
# Edit .env: set LLM_API_KEY (model under eval), EVAL_LLM_API_KEY (judge); optional LLM_BASE_URL / EVAL_LLM_BASE_URL; EVAL_LLM_MODEL defaults to gemini/gemini-2.5-pro
# Prerequisites: Docker (allocate >=8GB RAM, 10GB+ recommended), uv, jq, Python 3.10+
docker pull ghcr.io/scaleapi/mcp-atlas:1.2.5
shell
docker tag ghcr.io/scaleapi/mcp-atlas:1.2.5 agent-environment:latest

2Run evaluation

shell
# Terminal 1: start the MCP server environment (port 1984). Alternatively build from source: make build && make run-docker
make run-docker
shell
# Wait for log: 'Uvicorn running on http://0.0.0.0:1984', then verify servers are online (expect total:20,online:20):
curl -s http://localhost:1984/enabled-servers | jq -c
shell
# Terminal 2: start the completion service (port 3000), which manages the agentic LLM<->tools loop
make run-mcp-completion
shell
# Terminal 3: generate completions over the full public dataset (run from services/mcp_eval)
cd services/mcp_eval
shell
uv run python mcp_completion_script.py --model "openai/gpt-5.1" --input_huggingface "ScaleAI/MCP-Atlas" --output "mcp_eval_51_results.csv"
shell
# (Quick smoke test instead: uv run python mcp_completion_script.py --model "openai/gpt-5.1" --input "sample_tasks.csv" --output "sample_51_results.csv")

3Score output

shell
# Run from services/mcp_eval; scores trajectories with the LLM judge and computes pass rate
uv run mcp_evals_scores.py --input-file="completion_results/mcp_eval_51_results.csv" --model-label="gpt51"

4Expected output

The completion run writes a results CSV (e.g. mcp_eval_51_results.csv) under completion_results/ containing both ground truth and model trajectory/completion data. Scoring writes to evaluation_results/: scored_{label}.csv (per-task coverage scores), coverage_stats_{label}.csv (summary statistics, including the headline pass rate = fraction of tasks with coverage_score >= --pass-threshold, default 0.75), and coverage_histogram_{label}.png (score distribution). The pass rate is benchmark-local: do not compare it against tool-use scores from other suites, and note it shifts with how many MCP servers were online and the chosen pass threshold.

5Submit results

Scores are computed locally; the public leaderboard at https://scale.com/leaderboard/mcp_atlas is maintained by Scale AI (no documented self-serve submission). When reporting a number, attach: model id and provider/base URL, the Docker image tag (e.g. ghcr.io/scaleapi/mcp-atlas:1.2.5) or harness commit, which MCP servers were online (run curl http://localhost:1984/enabled-servers; with no API keys only the 20 default servers run, covering ~18% of tasks), the evaluator/judge model (default gemini/gemini-2.5-pro), --pass-threshold used, --concurrency, and that the run used the public 500-sample subset of ScaleAI/MCP-Atlas (not the full internal benchmark).

Gotchas

Coverage depends on API keys: only ~18% of the 500 tasks are runnable with the 20 default no-key servers. Adding keys for exa/airtable/mongodb/oxylabs/brave-search etc. materially changes the achievable pass rate, so a score is only comparable against runs with the same server set online. Five servers (Airtable, Google Calendar/google-workspace, Notion, MongoDB, Slack) also need sample-data uploads per data_exports/README.md or tasks return erroneous results.
The HF dataset is a public 500-sample subset, not the full MCP Atlas benchmark, so locally computed pass rates are not directly comparable to the official leaderboard numbers.
Two long-running services must be up before evaluating: the Docker MCP environment on port 1984 (takes 1+ min; wait for the Uvicorn message and confirm total:20,online:20 via /enabled-servers) and the completion service on port 3000 (make run-mcp-completion). Docker needs >=8GB RAM. If you change LLM_API_KEY, restart make run-mcp-completion; after adding server keys, restart the Docker container and re-run uv run test_servers.py.
Flag names are exact and easy to mix up: local file input is --input, HF dataset input is --input_huggingface (underscore, not dash); scoring uses --input-file and --model-label (dashes). The scoring script is mcp_evals_scores.py (plural 'evals'); steps 5-8 (including completion and scoring) are run from services/mcp_eval. Completion auto-skips tasks already present in the output CSV, so delete/rename it to force a full re-run.