BenchmarksChat preference
Arena-Hard-Auto v2.0
An automatic LLM benchmark of 500 hard real-world queries (plus 250 creative-writing prompts) sourced from Chatbot Arena, scored as a win rate against a baseline using LLM judges (GPT-4.1 and Gemini-2.5) as a cheap proxy for human preference.
Chat preference% win rateHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| o3 | OpenAI | 85.9% | — | Official | Apr 16, 2025 | Details |
| Gemini 2.5 Pro | Google DeepMind | 79.0% | — | Official | Mar 25, 2025 | Details |
| o4-mini | OpenAI | 74.6% | — | Official | Apr 16, 2025 | Details |
| Gemini 2.5 Flash | Google DeepMind | 68.6% | — | Official | Apr 17, 2025 | Details |
| Claude 3.7 Sonnet | Anthropic | 59.8% | — | Official | Feb 24, 2025 | Details |
| DeepSeek R1 | DeepSeek | 58.0% | — | Official | Jan 20, 2025 | Details |
| GPT-4.1 | OpenAI | 50.0% | — | Official | Apr 14, 2025 | Details |
| Claude 3.5 Sonnet | Anthropic | 33.0% | — | Official | Jun 20, 2024 | Details |
| Llama 4 Maverick | Meta | 17.2% | — | Official | Apr 5, 2025 | Details |
Each row reports the model’s % win rate on Arena-Hard-Auto v2.0. Click a row for the full run context.