Question 1

What is Online-Mind2Web?

Accepted Answer

A live web-agent benchmark of 300 realistic tasks across 136 real websites that measures whether an autonomous agent can complete end-to-end web tasks on dynamic, online pages, scored as task success rate. It is a agents benchmark measured by Task success rate.

Question 2

What does Task success rate mean on Online-Mind2Web?

Accepted Answer

Online-Mind2Web reports Task success rate (%); higher is better. Scores are shown only within Online-Mind2Web and are never averaged with other benchmarks.

Question 3

What is the top reported Online-Mind2Web score?

Accepted Answer

GPT-5.4 has the top reported score on Online-Mind2Web: 92.8% (Task success rate).

Question 4

Why do Online-Mind2Web scores differ across runs?

Accepted Answer

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Question 5

Does evals.report rank models across benchmarks?

Accepted Answer

No. Online-Mind2Web scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

Model	Lab	Score↓	Source model	Status	Date
GPT-5.4	OpenAI	92.8%	—	Verified	Mar 5, 2026	Details
GPT-5	OpenAI	42.33%	—	Verified	Aug 7, 2025	Details
Claude Sonnet 4	Anthropic	40.00%	—	Verified	May 22, 2025	Details
Claude 3.7 Sonnet	Anthropic	39.33%	—	Verified	Feb 24, 2025	Details
o3	OpenAI	39.00%	—	Verified	Apr 16, 2025	Details
GPT-4.1	OpenAI	36.33%	—	Verified	Apr 14, 2025	Details
DeepSeek V3Open	DeepSeek	32.33%	—	Verified	Dec 26, 2024	Details
o4-mini	OpenAI	32.00%	—	Verified	Apr 16, 2025	Details
GPT-4o	OpenAI	30.7%	—	Official	May 13, 2024	Details
Gemini 2.0 Flash	Google DeepMind	29.00%	—	Verified	Dec 11, 2024	Details
Claude 3.5 Sonnet	Anthropic	29.0%	—	Official	Jun 20, 2024	Details
DeepSeek R1Open	DeepSeek	25.33%	—	Verified	Jan 20, 2025	Details