evals.report
BenchmarksLabsCompareRun guides

WebArena

A reproducible, self-hostable web environment of fully functional sites (e-commerce, content management, social forum, and software development) where autonomous agents are scored on the functional-correctness success rate of completing 812 realistic, long-horizon, multi-step web tasks.

AgentsTask success rateHigher is better
ModelLabScoreSource modelStatusDate
Claude Opus 4.6Anthropic68.0%VerifiedFeb 5, 2026Details
Claude Sonnet 4.6Anthropic65.6%VerifiedFeb 17, 2026Details
Claude Opus 4.5Anthropic65.3%VerifiedNov 24, 2025Details
Claude Sonnet 4.5Anthropic58.5%VerifiedSep 29, 2025Details
Gemini 2.5 ProGoogle DeepMind54.8%VerifiedMar 25, 2025Details
Claude Haiku 4.5Anthropic53.1%VerifiedOct 15, 2025Details
Claude 3.7 SonnetAnthropic52.0%VerifiedFeb 24, 2025Details
GPT-4oOpenAI42.8%VerifiedMay 13, 2024Details

Each row reports the model’s Task success rate on WebArena. Click a row for the full run context.