evals.report
BenchmarksLabsCompareRun guides

BrowseComp

A benchmark of 1,266 hard-to-find, multi-hop web-browsing questions whose answers are difficult to locate but easy to verify, measuring an agent's ability to persistently search and synthesize information from the web.

AgentsaccuracyHigher is better

No run guide for this benchmark yet.