evals.report
BenchmarksLabsCompareRun guides
BenchmarksChat preference

EQ-Bench Creative Writing v3

An LLM-judged creative writing benchmark that scores models across 32 prompts (3 iterations each) using a hybrid of rubric scoring and pairwise Elo comparisons computed with a margin-weighted Glicko-2 rating system.

Chat preferenceEloHigher is better

No run guide for this benchmark yet.