Arena-Hard-Auto v2.0
An automatic LLM benchmark of 500 hard real-world queries (plus 250 creative-writing prompts) sourced from Chatbot Arena, scored as a win rate against a baseline using LLM judges (GPT-4.1 and Gemini-2.5) as a cheap proxy for human preference.
What this benchmark measures
An automatic LLM benchmark of 500 hard real-world queries (plus 250 creative-writing prompts) sourced from Chatbot Arena, scored as a win rate against a baseline using LLM judges (GPT-4.1 and Gemini-2.5) as a cheap proxy for human preference.
Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.
The metric shown here is % win rate. It should be interpreted within Arena-Hard-Auto v2.0, not compared as part of a site-wide ranking.