BenchmarksTool use
MCP-Universe
A benchmark from Salesforce AI Research that evaluates LLMs and agents on real-world Model Context Protocol (MCP) server tasks across six domains (location navigation, repository management, financial analysis, 3D design, browser automation, web searching), measuring end-to-end task success rate.
Tool useOverall Success RateHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Gemini 3 Pro | Google DeepMind | 44.59% | — | Verified | Nov 18, 2025 | Details |
| GPT-5 | OpenAI | 44.16% | — | Verified | Aug 7, 2025 | Details |
| Grok 4.1 fast reasoning | xAI | 40.69% | — | Verified | Nov 19, 2025 | Details |
| Claude Sonnet 4.5 | Anthropic | 35.06% | — | Verified | Sep 29, 2025 | Details |
| Grok 4 | xAI | 33.33% | — | Verified | Jul 9, 2025 | Details |
| Claude Sonnet 4 | Anthropic | 32.90% | — | Verified | May 22, 2025 | Details |
| Claude Opus 4.1 | Anthropic | 29.44% | — | Verified | Aug 5, 2025 | Details |
| Claude Opus 4 | Anthropic | 28.14% | — | Verified | May 22, 2025 | Details |
| o3 | OpenAI | 26.41% | — | Verified | Apr 16, 2025 | Details |
| Claude Haiku 4.5 | Anthropic | 26.41% | — | Verified | Oct 15, 2025 | Details |
| Kimi K2 Thinking | Moonshot AI | 26.41% | — | Verified | Nov 6, 2025 | Details |
| o4-mini | OpenAI | 25.97% | — | Verified | Apr 16, 2025 | Details |
| GLM-4.6 | Z.ai | 25.97% | — | Verified | Sep 30, 2025 | Details |
| GPT-OSS-120B | OpenAI | 25.54% | — | Verified | Aug 5, 2025 | Details |
| Claude 3.7 Sonnet | Anthropic | 24.24% | — | Verified | Feb 24, 2025 | Details |
| Qwen 3 Coder 480B | Alibaba / Qwen | 22.94% | — | Verified | Jul 22, 2025 | Details |
| Gemini 2.5 Pro | Google DeepMind | 22.08% | — | Verified | Mar 25, 2025 | Details |
| DeepSeek V3.1 | DeepSeek | 22.08% | — | Verified | Aug 21, 2025 | Details |
| Gemini 2.5 Flash | Google DeepMind | 21.65% | — | Verified | Apr 17, 2025 | Details |
| DeepSeek V3.2 | DeepSeek | 19.91% | — | Verified | Dec 1, 2025 | Details |
| GPT-4.1 | OpenAI | 19.91% | — | Verified | Apr 14, 2025 | Details |
| Qwen3 235B A22B Instruct 2507 | Alibaba / Qwen | 18.18% | — | Verified | Jul 21, 2025 | Details |
| GPT-4o | OpenAI | 15.58% | — | Verified | May 13, 2024 | Details |
| DeepSeek V3 | DeepSeek | 14.29% | — | Verified | Dec 26, 2024 | Details |
Each row reports the model’s Overall Success Rate on MCP-Universe. Click a row for the full run context.