evals.report
BenchmarksLabsCompareRun guides
BenchmarksChat preference

Arena-Hard-Auto v2.0

An automatic LLM benchmark of 500 hard real-world queries (plus 250 creative-writing prompts) sourced from Chatbot Arena, scored as a win rate against a baseline using LLM judges (GPT-4.1 and Gemini-2.5) as a cheap proxy for human preference.

Chat preference% win rateHigher is better

No run guide for this benchmark yet.