evals.report
BenchmarksLabsCompareRun guides
BenchmarksChat preference

EQ-Bench Creative Writing v3

An LLM-judged creative writing benchmark that scores models across 32 prompts (3 iterations each) using a hybrid of rubric scoring and pairwise Elo comparisons computed with a margin-weighted Glicko-2 rating system.

Chat preferenceEloHigher is better
ModelLabScoreSource modelStatusDate
Claude Opus 4.7Anthropic2206VerifiedApr 16, 2026Details
GPT-5.5OpenAI2035VerifiedApr 23, 2026Details
Claude Opus 4.8Anthropic2031VerifiedMay 28, 2026Details
GPT-5.4OpenAI2003VerifiedMar 5, 2026Details
Claude Sonnet 4.6Anthropic1968VerifiedFeb 17, 2026Details
Claude Opus 4.6Anthropic1933VerifiedFeb 5, 2026Details
GPT-5.2OpenAI1783VerifiedDec 11, 2025Details
Kimi K2.6Moonshot AI1782VerifiedApr 20, 2026Details
Claude Sonnet 4.5Anthropic1767VerifiedSep 29, 2025Details
Claude Opus 4.5Anthropic1762VerifiedNov 24, 2025Details
o3OpenAI1744VerifiedApr 16, 2025Details
Kimi K2 InstructMoonshot AI1738VerifiedJul 11, 2025Details
Kimi K2 ThinkingMoonshot AI1695VerifiedNov 6, 2025Details
Grok 4.20 beta reasoningxAI1675VerifiedMar 9, 2026Details
GLM-5Z.ai1658VerifiedFeb 11, 2026Details
GLM-5.1Z.ai1645VerifiedApr 7, 2026Details
GPT-5OpenAI1640VerifiedAug 7, 2025Details
Claude Opus 4Anthropic1639VerifiedMay 22, 2025Details
Kimi K2.5Moonshot AI1593VerifiedJan 27, 2026Details
DeepSeek V4 ProDeepSeek1570VerifiedApr 24, 2026Details
DeepSeek V4 FlashDeepSeek1556VerifiedApr 24, 2026Details
DeepSeek V3.2DeepSeek1515VerifiedDec 1, 2025Details
Claude Sonnet 4Anthropic1514VerifiedMay 22, 2025Details
Gemini 3 ProGoogle DeepMind1504VerifiedNov 18, 2025Details
DeepSeek R1DeepSeek1500VerifiedJan 20, 2025Details
GPT-4oOpenAI1484VerifiedMay 13, 2024Details
Gemini 3.1 Pro PreviewGoogle DeepMind1479VerifiedFeb 19, 2026Details
DeepSeek V3 0324DeepSeek1474VerifiedMar 24, 2025Details
Qwen3.5-397B-A17BAlibaba / Qwen1469VerifiedFeb 16, 2026Details
Claude 3.5 SonnetAnthropic1448VerifiedJun 20, 2024Details
DeepSeek V3.1DeepSeek1420VerifiedAug 21, 2025Details
GPT-4.1OpenAI1419VerifiedApr 14, 2025Details
Gemini 2.5 ProGoogle DeepMind1417VerifiedMar 25, 2025Details
GLM-4.7Z.ai1403VerifiedDec 22, 2025Details
Mistral LargeMistral AI1402VerifiedFeb 26, 2024Details
Claude 3.7 SonnetAnthropic1395VerifiedFeb 24, 2025Details
GLM-4.6Z.ai1393VerifiedSep 30, 2025Details
MiniMax M2.5MiniMax1331VerifiedFeb 12, 2026Details
Grok 4.1 fast reasoningxAI1317VerifiedNov 19, 2025Details
GPT-5 miniOpenAI1298VerifiedAug 7, 2025Details
Reka Flash 3Reka AI1250VerifiedMar 10, 2025Details
Gemini 2.5 FlashGoogle DeepMind1243VerifiedApr 17, 2025Details
Gemini 2.0 FlashGoogle DeepMind1239VerifiedDec 11, 2024Details
GPT-OSS-120BOpenAI1041VerifiedAug 5, 2025Details
Llama 3.1 405BMeta953VerifiedJul 23, 2024Details
Llama 4 MaverickMeta927VerifiedApr 5, 2025Details
Llama 4 ScoutMeta883VerifiedApr 5, 2025Details

Each row reports the model’s Elo on EQ-Bench Creative Writing v3. Click a row for the full run context.