BenchmarksChat preference
EQ-Bench Creative Writing v3
An LLM-judged creative writing benchmark that scores models across 32 prompts (3 iterations each) using a hybrid of rubric scoring and pairwise Elo comparisons computed with a margin-weighted Glicko-2 rating system.
Chat preferenceEloHigher is better
No run guide for this benchmark yet.