evals.report
BenchmarksLabsCompareRun guides
BenchmarksChat preference

Arena-Hard-Auto v2.0

An automatic LLM benchmark of 500 hard real-world queries (plus 250 creative-writing prompts) sourced from Chatbot Arena, scored as a win rate against a baseline using LLM judges (GPT-4.1 and Gemini-2.5) as a cheap proxy for human preference.

Chat preference% win rateHigher is better
ModelLabScoreSource modelStatusDate
o3OpenAI85.9%OfficialApr 16, 2025Details
Gemini 2.5 ProGoogle DeepMind79.0%OfficialMar 25, 2025Details
o4-miniOpenAI74.6%OfficialApr 16, 2025Details
Gemini 2.5 FlashGoogle DeepMind68.6%OfficialApr 17, 2025Details
Claude 3.7 SonnetAnthropic59.8%OfficialFeb 24, 2025Details
DeepSeek R1DeepSeek58.0%OfficialJan 20, 2025Details
GPT-4.1OpenAI50.0%OfficialApr 14, 2025Details
Claude 3.5 SonnetAnthropic33.0%OfficialJun 20, 2024Details
Llama 4 MaverickMeta17.2%OfficialApr 5, 2025Details

Each row reports the model’s % win rate on Arena-Hard-Auto v2.0. Click a row for the full run context.