evals.report
BenchmarksLabsCompareRun guides

OSWorld

OSWorld benchmarks multimodal AI agents on their ability to complete open-ended, real-world computer-use tasks (operating GUIs across web, files, and applications) in live operating-system environments via screenshots and mouse/keyboard control, measured by execution-based task success rate.

Agentstask success rateHigher is better
ModelLabScoreSource modelStatusDate
Claude Opus 4.8Anthropic83.4%VerifiedMay 28, 2026Details
Claude Opus 4.7Anthropic82.8%VerifiedApr 16, 2026Details
Claude Mythos PreviewAnthropic79.6%UnverifiedApr 7, 2026Details
GPT-5.5OpenAI78.7%UnverifiedApr 23, 2026Details
Gemini 3.5 FlashGoogle DeepMind78.4%UnverifiedMay 19, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind76.2%VerifiedFeb 19, 2026Details
GPT-5.4OpenAI75.0%UnverifiedMar 5, 2026Details
Kimi K2.6Moonshot AI73.1%UnverifiedApr 20, 2026Details
Claude Opus 4.6Anthropic72.7%UnverifiedFeb 5, 2026Details
Claude Sonnet 4.6Anthropic72.1%UnverifiedFeb 17, 2026Details
MiniMax M3MiniMax70.1%UnverifiedJun 1, 2026Details
Claude Opus 4.5Anthropic66.3%UnverifiedNov 24, 2025Details
GPT-5.3-CodexOpenAI64.7%UnverifiedFeb 5, 2026Details
Claude Sonnet 4.5Anthropic61.4%UnverifiedSep 29, 2025Details
GPT-5.2OpenAI47.3%UnverifiedDec 11, 2025Details

Each row reports the model’s task success rate on OSWorld. Click a row for the full run context.