BenchmarksMultimodal
ScreenSpot-Pro
A GUI grounding benchmark that measures how accurately a multimodal model can locate a referenced UI element (return its position) given a natural-language instruction and a full-screen, high-resolution screenshot of professional desktop software across 23 applications, 5 industries, and 3 operating systems.
MultimodalaccuracyHigher is better
No run guide for this benchmark yet.