evals.report
BenchmarksLabsCompareRun guides
BenchmarksChat preference

Design Arena

A crowdsourced human-preference benchmark where top AI models receive identical design/frontend prompts and users vote head-to-head on the anonymized outputs, producing a Bradley-Terry (Elo) ranking of design taste across categories like websites, UI components, games, and data visualization.

Chat preferenceEloHigher is better

What this benchmark measures

A crowdsourced human-preference benchmark where top AI models receive identical design/frontend prompts and users vote head-to-head on the anonymized outputs, producing a Bradley-Terry (Elo) ranking of design taste across categories like websites, UI components, games, and data visualization.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is Elo. It should be interpreted within Design Arena, not compared as part of a site-wide ranking.

No composite ranking
evals.report never combines benchmarks. Elo on Design Arena is its own number — don’t average it with other metrics.