evals.report
BenchmarksLabsCompareRun guides

WebArena

A reproducible, self-hostable web environment of fully functional sites (e-commerce, content management, social forum, and software development) where autonomous agents are scored on the functional-correctness success rate of completing 812 realistic, long-horizon, multi-step web tasks.

AgentsTask success rateHigher is better

No run guide for this benchmark yet.