BenchmarksChat preference
Arena-Hard-Auto v2.0
An automatic LLM benchmark of 500 hard real-world queries (plus 250 creative-writing prompts) sourced from Chatbot Arena, scored as a win rate against a baseline using LLM judges (GPT-4.1 and Gemini-2.5) as a cheap proxy for human preference.
Chat preference% win rateHigher is better
No run guide for this benchmark yet.