evals.report
BenchmarksLabsCompareRun guides

BrowseComp

A benchmark of 1,266 hard-to-find, multi-hop web-browsing questions whose answers are difficult to locate but easy to verify, measuring an agent's ability to persistently search and synthesize information from the web.

AgentsaccuracyHigher is better
ModelLabScoreSource modelStatusDate
Kimi K2 ThinkingMoonshot AI60.2%VerifiedNov 6, 2025Details
GPT-5OpenAI54.9%VerifiedAug 7, 2025Details
o3OpenAI49.7%VerifiedApr 16, 2025Details
DeepSeek V3.2DeepSeek40.1%UnverifiedDec 1, 2025Details
o4-miniOpenAI28.3%VerifiedApr 16, 2025Details
Claude Sonnet 4.5Anthropic24.1%UnverifiedSep 29, 2025Details
GPT-4oOpenAI0.6%VerifiedMay 13, 2024Details

Each row reports the model’s accuracy on BrowseComp. Click a row for the full run context.