BenchmarksAgents
BrowseComp
A benchmark of 1,266 hard-to-find, multi-hop web-browsing questions whose answers are difficult to locate but easy to verify, measuring an agent's ability to persistently search and synthesize information from the web.
AgentsaccuracyHigher is better
No run guide for this benchmark yet.