evals.report
BenchmarksLabsCompareRun guides
BenchmarksChat preference

Arena-Hard-Auto v2.0

An automatic LLM benchmark of 500 hard real-world queries (plus 250 creative-writing prompts) sourced from Chatbot Arena, scored as a win rate against a baseline using LLM judges (GPT-4.1 and Gemini-2.5) as a cheap proxy for human preference.

Chat preference% win rateHigher is better

What this benchmark measures

An automatic LLM benchmark of 500 hard real-world queries (plus 250 creative-writing prompts) sourced from Chatbot Arena, scored as a win rate against a baseline using LLM judges (GPT-4.1 and Gemini-2.5) as a cheap proxy for human preference.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is % win rate. It should be interpreted within Arena-Hard-Auto v2.0, not compared as part of a site-wide ranking.

No composite ranking
evals.report never combines benchmarks. % win rate on Arena-Hard-Auto v2.0 is its own number — don’t average it with other metrics.