BenchmarksAgents
WebArena
A reproducible, self-hostable web environment of fully functional sites (e-commerce, content management, social forum, and software development) where autonomous agents are scored on the functional-correctness success rate of completing 812 realistic, long-horizon, multi-step web tasks.
AgentsTask success rateHigher is better
No run guide for this benchmark yet.