BenchmarksAgents
WebArena
A reproducible, self-hostable web environment of fully functional sites (e-commerce, content management, social forum, and software development) where autonomous agents are scored on the functional-correctness success rate of completing 812 realistic, long-horizon, multi-step web tasks.
AgentsTask success rateHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Opus 4.6 | Anthropic | 68.0% | — | Verified | Feb 5, 2026 | Details |
| Claude Sonnet 4.6 | Anthropic | 65.6% | — | Verified | Feb 17, 2026 | Details |
| Claude Opus 4.5 | Anthropic | 65.3% | — | Verified | Nov 24, 2025 | Details |
| Claude Sonnet 4.5 | Anthropic | 58.5% | — | Verified | Sep 29, 2025 | Details |
| Gemini 2.5 Pro | Google DeepMind | 54.8% | — | Verified | Mar 25, 2025 | Details |
| Claude Haiku 4.5 | Anthropic | 53.1% | — | Verified | Oct 15, 2025 | Details |
| Claude 3.7 Sonnet | Anthropic | 52.0% | — | Verified | Feb 24, 2025 | Details |
| GPT-4o | OpenAI | 42.8% | — | Verified | May 13, 2024 | Details |
Each row reports the model’s Task success rate on WebArena. Click a row for the full run context.