BenchmarksMultimodal
ScreenSpot-Pro
A GUI grounding benchmark that measures how accurately a multimodal model can locate a referenced UI element (return its position) given a natural-language instruction and a full-screen, high-resolution screenshot of professional desktop software across 23 applications, 5 industries, and 3 operating systems.
MultimodalaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Opus 4.8 | Anthropic | 87.9% | — | Unverified | May 28, 2026 | Details |
| GPT-5.2 | OpenAI | 86.3% | — | Unverified | Dec 11, 2025 | Details |
| GPT-5.4 | OpenAI | 85.4% | — | Unverified | Mar 5, 2026 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 84.4% | — | Unverified | Feb 19, 2026 | Details |
| Muse Spark | Meta | 84.1% | — | Unverified | Apr 8, 2026 | Details |
| Claude Opus 4.6 | Anthropic | 83.1% | — | Unverified | Feb 5, 2026 | Details |
| Gemini 3 Pro | Google DeepMind | 72.7% | — | Verified | Nov 18, 2025 | Details |
| Gemini 3 Flash | Google DeepMind | 69.1% | — | Unverified | Dec 17, 2025 | Details |
| Qwen3.5-397B-A17B | Alibaba / Qwen | 65.6% | — | Unverified | Feb 16, 2026 | Details |
| Claude Opus 4.5 | Anthropic | 45.7% | — | Unverified | Nov 24, 2025 | Details |
| Claude Sonnet 4.5 | Anthropic | 36.2% | — | Verified | Sep 29, 2025 | Details |
| Gemini 2.5 Pro | Google DeepMind | 11.4% | — | Verified | Mar 25, 2025 | Details |
| GPT-5.1 | OpenAI | 3.5% | — | Verified | Nov 12, 2025 | Details |
Each row reports the model’s accuracy on ScreenSpot-Pro. Click a row for the full run context.