BrowseComp
A benchmark of 1,266 hard-to-find, multi-hop web-browsing questions whose answers are difficult to locate but easy to verify, measuring an agent's ability to persistently search and synthesize information from the web.
What this benchmark measures
A benchmark of 1,266 hard-to-find, multi-hop web-browsing questions whose answers are difficult to locate but easy to verify, measuring an agent's ability to persistently search and synthesize information from the web.
Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.
The metric shown here is accuracy. It should be interpreted within BrowseComp, not compared as part of a site-wide ranking.