BenchmarksAgents
OSWorld
OSWorld benchmarks multimodal AI agents on their ability to complete open-ended, real-world computer-use tasks (operating GUIs across web, files, and applications) in live operating-system environments via screenshots and mouse/keyboard control, measured by execution-based task success rate.
Agentstask success rateHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Opus 4.8 | Anthropic | 83.4% | — | Verified | May 28, 2026 | Details |
| Claude Opus 4.7 | Anthropic | 82.8% | — | Verified | Apr 16, 2026 | Details |
| Claude Mythos Preview | Anthropic | 79.6% | — | Unverified | Apr 7, 2026 | Details |
| GPT-5.5 | OpenAI | 78.7% | — | Unverified | Apr 23, 2026 | Details |
| Gemini 3.5 Flash | Google DeepMind | 78.4% | — | Unverified | May 19, 2026 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 76.2% | — | Verified | Feb 19, 2026 | Details |
| GPT-5.4 | OpenAI | 75.0% | — | Unverified | Mar 5, 2026 | Details |
| Kimi K2.6 | Moonshot AI | 73.1% | — | Unverified | Apr 20, 2026 | Details |
| Claude Opus 4.6 | Anthropic | 72.7% | — | Unverified | Feb 5, 2026 | Details |
| Claude Sonnet 4.6 | Anthropic | 72.1% | — | Unverified | Feb 17, 2026 | Details |
| MiniMax M3 | MiniMax | 70.1% | — | Unverified | Jun 1, 2026 | Details |
| Claude Opus 4.5 | Anthropic | 66.3% | — | Unverified | Nov 24, 2025 | Details |
| GPT-5.3-Codex | OpenAI | 64.7% | — | Unverified | Feb 5, 2026 | Details |
| Claude Sonnet 4.5 | Anthropic | 61.4% | — | Unverified | Sep 29, 2025 | Details |
| GPT-5.2 | OpenAI | 47.3% | — | Unverified | Dec 11, 2025 | Details |
Each row reports the model’s task success rate on OSWorld. Click a row for the full run context.