BenchmarksAgents
OSWorld
OSWorld benchmarks multimodal AI agents on their ability to complete open-ended, real-world computer-use tasks (operating GUIs across web, files, and applications) in live operating-system environments via screenshots and mouse/keyboard control, measured by execution-based task success rate.
Agentstask success rateHigher is better
No run guide for this benchmark yet.