Question 1

What is Online-Mind2Web?

Accepted Answer

A live web-agent benchmark of 300 realistic tasks across 136 real websites that measures whether an autonomous agent can complete end-to-end web tasks on dynamic, online pages, scored as task success rate. It is a agents benchmark measured by Task success rate.

Question 2

What does Task success rate mean on Online-Mind2Web?

Accepted Answer

Online-Mind2Web reports Task success rate (%); higher is better. Scores are shown only within Online-Mind2Web and are never averaged with other benchmarks.

Question 3

What is the top reported Online-Mind2Web score?

Accepted Answer

GPT-5.4 has the top reported score on Online-Mind2Web: 92.8% (Task success rate).

Question 4

Why do Online-Mind2Web scores differ across runs?

Accepted Answer

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Question 5

Does evals.report rank models across benchmarks?

Accepted Answer

No. Online-Mind2Web scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

Online-Mind2Web

What this benchmark measures

Frequently asked