BenchmarksAgents
BrowseComp
A benchmark of 1,266 hard-to-find, multi-hop web-browsing questions whose answers are difficult to locate but easy to verify, measuring an agent's ability to persistently search and synthesize information from the web.
AgentsaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Kimi K2 Thinking | Moonshot AI | 60.2% | — | Verified | Nov 6, 2025 | Details |
| GPT-5 | OpenAI | 54.9% | — | Verified | Aug 7, 2025 | Details |
| o3 | OpenAI | 49.7% | — | Verified | Apr 16, 2025 | Details |
| DeepSeek V3.2 | DeepSeek | 40.1% | — | Unverified | Dec 1, 2025 | Details |
| o4-mini | OpenAI | 28.3% | — | Verified | Apr 16, 2025 | Details |
| Claude Sonnet 4.5 | Anthropic | 24.1% | — | Unverified | Sep 29, 2025 | Details |
| GPT-4o | OpenAI | 0.6% | — | Verified | May 13, 2024 | Details |
Each row reports the model’s accuracy on BrowseComp. Click a row for the full run context.