BenchmarksAgents
Online-Mind2Web
A live web-agent benchmark of 300 realistic tasks across 136 real websites that measures whether an autonomous agent can complete end-to-end web tasks on dynamic, online pages, scored as task success rate.
AgentsTask success rateHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| GPT-5.4 | OpenAI | 92.8% | — | Verified | Mar 5, 2026 | Details |
| GPT-5 | OpenAI | 42.33% | — | Verified | Aug 7, 2025 | Details |
| Claude Sonnet 4 | Anthropic | 40.00% | — | Verified | May 22, 2025 | Details |
| Claude 3.7 Sonnet | Anthropic | 39.33% | — | Verified | Feb 24, 2025 | Details |
| o3 | OpenAI | 39.00% | — | Verified | Apr 16, 2025 | Details |
| GPT-4.1 | OpenAI | 36.33% | — | Verified | Apr 14, 2025 | Details |
| DeepSeek V3 | DeepSeek | 32.33% | — | Verified | Dec 26, 2024 | Details |
| o4-mini | OpenAI | 32.00% | — | Verified | Apr 16, 2025 | Details |
| GPT-4o | OpenAI | 30.7% | — | Official | May 13, 2024 | Details |
| Gemini 2.0 Flash | Google DeepMind | 29.00% | — | Verified | Dec 11, 2024 | Details |
| Claude 3.5 Sonnet | Anthropic | 29.0% | — | Official | Jun 20, 2024 | Details |
| DeepSeek R1 | DeepSeek | 25.33% | — | Verified | Jan 20, 2025 | Details |
Each row reports the model’s Task success rate on Online-Mind2Web. Click a row for the full run context.