evals.report
BenchmarksLabsCompareRun guides
BenchmarksTool use

MCP-Universe

A benchmark from Salesforce AI Research that evaluates LLMs and agents on real-world Model Context Protocol (MCP) server tasks across six domains (location navigation, repository management, financial analysis, 3D design, browser automation, web searching), measuring end-to-end task success rate.

Tool useOverall Success RateHigher is better
ModelLabScoreSource modelStatusDate
Gemini 3 ProGoogle DeepMind44.59%VerifiedNov 18, 2025Details
GPT-5OpenAI44.16%VerifiedAug 7, 2025Details
Grok 4.1 fast reasoningxAI40.69%VerifiedNov 19, 2025Details
Claude Sonnet 4.5Anthropic35.06%VerifiedSep 29, 2025Details
Grok 4xAI33.33%VerifiedJul 9, 2025Details
Claude Sonnet 4Anthropic32.90%VerifiedMay 22, 2025Details
Claude Opus 4.1Anthropic29.44%VerifiedAug 5, 2025Details
Claude Opus 4Anthropic28.14%VerifiedMay 22, 2025Details
o3OpenAI26.41%VerifiedApr 16, 2025Details
Claude Haiku 4.5Anthropic26.41%VerifiedOct 15, 2025Details
Kimi K2 ThinkingMoonshot AI26.41%VerifiedNov 6, 2025Details
o4-miniOpenAI25.97%VerifiedApr 16, 2025Details
GLM-4.6Z.ai25.97%VerifiedSep 30, 2025Details
GPT-OSS-120BOpenAI25.54%VerifiedAug 5, 2025Details
Claude 3.7 SonnetAnthropic24.24%VerifiedFeb 24, 2025Details
Qwen 3 Coder 480BAlibaba / Qwen22.94%VerifiedJul 22, 2025Details
Gemini 2.5 ProGoogle DeepMind22.08%VerifiedMar 25, 2025Details
DeepSeek V3.1DeepSeek22.08%VerifiedAug 21, 2025Details
Gemini 2.5 FlashGoogle DeepMind21.65%VerifiedApr 17, 2025Details
DeepSeek V3.2DeepSeek19.91%VerifiedDec 1, 2025Details
GPT-4.1OpenAI19.91%VerifiedApr 14, 2025Details
Qwen3 235B A22B Instruct 2507Alibaba / Qwen18.18%VerifiedJul 21, 2025Details
GPT-4oOpenAI15.58%VerifiedMay 13, 2024Details
DeepSeek V3DeepSeek14.29%VerifiedDec 26, 2024Details

Each row reports the model’s Overall Success Rate on MCP-Universe. Click a row for the full run context.