evals.report
BenchmarksLabsCompareRun guides
BenchmarksMultimodal

ScreenSpot-Pro

A GUI grounding benchmark that measures how accurately a multimodal model can locate a referenced UI element (return its position) given a natural-language instruction and a full-screen, high-resolution screenshot of professional desktop software across 23 applications, 5 industries, and 3 operating systems.

MultimodalaccuracyHigher is better
ModelLabScoreSource modelStatusDate
Claude Opus 4.8Anthropic87.9%UnverifiedMay 28, 2026Details
GPT-5.2OpenAI86.3%UnverifiedDec 11, 2025Details
GPT-5.4OpenAI85.4%UnverifiedMar 5, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind84.4%UnverifiedFeb 19, 2026Details
Muse SparkMeta84.1%UnverifiedApr 8, 2026Details
Claude Opus 4.6Anthropic83.1%UnverifiedFeb 5, 2026Details
Gemini 3 ProGoogle DeepMind72.7%VerifiedNov 18, 2025Details
Gemini 3 FlashGoogle DeepMind69.1%UnverifiedDec 17, 2025Details
Qwen3.5-397B-A17BAlibaba / Qwen65.6%UnverifiedFeb 16, 2026Details
Claude Opus 4.5Anthropic45.7%UnverifiedNov 24, 2025Details
Claude Sonnet 4.5Anthropic36.2%VerifiedSep 29, 2025Details
Gemini 2.5 ProGoogle DeepMind11.4%VerifiedMar 25, 2025Details
GPT-5.1OpenAI3.5%VerifiedNov 12, 2025Details

Each row reports the model’s accuracy on ScreenSpot-Pro. Click a row for the full run context.