BenchmarksChat preference
EQ-Bench Creative Writing v3
An LLM-judged creative writing benchmark that scores models across 32 prompts (3 iterations each) using a hybrid of rubric scoring and pairwise Elo comparisons computed with a margin-weighted Glicko-2 rating system.
Chat preferenceEloHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Opus 4.7 | Anthropic | 2206 | — | Verified | Apr 16, 2026 | Details |
| GPT-5.5 | OpenAI | 2035 | — | Verified | Apr 23, 2026 | Details |
| Claude Opus 4.8 | Anthropic | 2031 | — | Verified | May 28, 2026 | Details |
| GPT-5.4 | OpenAI | 2003 | — | Verified | Mar 5, 2026 | Details |
| Claude Sonnet 4.6 | Anthropic | 1968 | — | Verified | Feb 17, 2026 | Details |
| Claude Opus 4.6 | Anthropic | 1933 | — | Verified | Feb 5, 2026 | Details |
| GPT-5.2 | OpenAI | 1783 | — | Verified | Dec 11, 2025 | Details |
| Kimi K2.6 | Moonshot AI | 1782 | — | Verified | Apr 20, 2026 | Details |
| Claude Sonnet 4.5 | Anthropic | 1767 | — | Verified | Sep 29, 2025 | Details |
| Claude Opus 4.5 | Anthropic | 1762 | — | Verified | Nov 24, 2025 | Details |
| o3 | OpenAI | 1744 | — | Verified | Apr 16, 2025 | Details |
| Kimi K2 Instruct | Moonshot AI | 1738 | — | Verified | Jul 11, 2025 | Details |
| Kimi K2 Thinking | Moonshot AI | 1695 | — | Verified | Nov 6, 2025 | Details |
| Grok 4.20 beta reasoning | xAI | 1675 | — | Verified | Mar 9, 2026 | Details |
| GLM-5 | Z.ai | 1658 | — | Verified | Feb 11, 2026 | Details |
| GLM-5.1 | Z.ai | 1645 | — | Verified | Apr 7, 2026 | Details |
| GPT-5 | OpenAI | 1640 | — | Verified | Aug 7, 2025 | Details |
| Claude Opus 4 | Anthropic | 1639 | — | Verified | May 22, 2025 | Details |
| Kimi K2.5 | Moonshot AI | 1593 | — | Verified | Jan 27, 2026 | Details |
| DeepSeek V4 Pro | DeepSeek | 1570 | — | Verified | Apr 24, 2026 | Details |
| DeepSeek V4 Flash | DeepSeek | 1556 | — | Verified | Apr 24, 2026 | Details |
| DeepSeek V3.2 | DeepSeek | 1515 | — | Verified | Dec 1, 2025 | Details |
| Claude Sonnet 4 | Anthropic | 1514 | — | Verified | May 22, 2025 | Details |
| Gemini 3 Pro | Google DeepMind | 1504 | — | Verified | Nov 18, 2025 | Details |
| DeepSeek R1 | DeepSeek | 1500 | — | Verified | Jan 20, 2025 | Details |
| GPT-4o | OpenAI | 1484 | — | Verified | May 13, 2024 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 1479 | — | Verified | Feb 19, 2026 | Details |
| DeepSeek V3 0324 | DeepSeek | 1474 | — | Verified | Mar 24, 2025 | Details |
| Qwen3.5-397B-A17B | Alibaba / Qwen | 1469 | — | Verified | Feb 16, 2026 | Details |
| Claude 3.5 Sonnet | Anthropic | 1448 | — | Verified | Jun 20, 2024 | Details |
| DeepSeek V3.1 | DeepSeek | 1420 | — | Verified | Aug 21, 2025 | Details |
| GPT-4.1 | OpenAI | 1419 | — | Verified | Apr 14, 2025 | Details |
| Gemini 2.5 Pro | Google DeepMind | 1417 | — | Verified | Mar 25, 2025 | Details |
| GLM-4.7 | Z.ai | 1403 | — | Verified | Dec 22, 2025 | Details |
| Mistral Large | Mistral AI | 1402 | — | Verified | Feb 26, 2024 | Details |
| Claude 3.7 Sonnet | Anthropic | 1395 | — | Verified | Feb 24, 2025 | Details |
| GLM-4.6 | Z.ai | 1393 | — | Verified | Sep 30, 2025 | Details |
| MiniMax M2.5 | MiniMax | 1331 | — | Verified | Feb 12, 2026 | Details |
| Grok 4.1 fast reasoning | xAI | 1317 | — | Verified | Nov 19, 2025 | Details |
| GPT-5 mini | OpenAI | 1298 | — | Verified | Aug 7, 2025 | Details |
| Reka Flash 3 | Reka AI | 1250 | — | Verified | Mar 10, 2025 | Details |
| Gemini 2.5 Flash | Google DeepMind | 1243 | — | Verified | Apr 17, 2025 | Details |
| Gemini 2.0 Flash | Google DeepMind | 1239 | — | Verified | Dec 11, 2024 | Details |
| GPT-OSS-120B | OpenAI | 1041 | — | Verified | Aug 5, 2025 | Details |
| Llama 3.1 405B | Meta | 953 | — | Verified | Jul 23, 2024 | Details |
| Llama 4 Maverick | Meta | 927 | — | Verified | Apr 5, 2025 | Details |
| Llama 4 Scout | Meta | 883 | — | Verified | Apr 5, 2025 | Details |
Each row reports the model’s Elo on EQ-Bench Creative Writing v3. Click a row for the full run context.