evals.report
BenchmarksLabsCompareRun guides

Online-Mind2Web

A live web-agent benchmark of 300 realistic tasks across 136 real websites that measures whether an autonomous agent can complete end-to-end web tasks on dynamic, online pages, scored as task success rate.

AgentsTask success rateHigher is better
ModelLabScoreSource modelStatusDate
GPT-5.4OpenAI92.8%VerifiedMar 5, 2026Details
GPT-5OpenAI42.33%VerifiedAug 7, 2025Details
Claude Sonnet 4Anthropic40.00%VerifiedMay 22, 2025Details
Claude 3.7 SonnetAnthropic39.33%VerifiedFeb 24, 2025Details
o3OpenAI39.00%VerifiedApr 16, 2025Details
GPT-4.1OpenAI36.33%VerifiedApr 14, 2025Details
DeepSeek V3DeepSeek32.33%VerifiedDec 26, 2024Details
o4-miniOpenAI32.00%VerifiedApr 16, 2025Details
GPT-4oOpenAI30.7%OfficialMay 13, 2024Details
Gemini 2.0 FlashGoogle DeepMind29.00%VerifiedDec 11, 2024Details
Claude 3.5 SonnetAnthropic29.0%OfficialJun 20, 2024Details
DeepSeek R1DeepSeek25.33%VerifiedJan 20, 2025Details

Each row reports the model’s Task success rate on Online-Mind2Web. Click a row for the full run context.