evals.report
BenchmarksLabsCompareRun guides
BenchmarksChat preference

LMArena

A public chat-preference evaluation surface with source-defined preference ratings and model comparisons.

Chat preferencesource-defined ratingHigher is better

What this benchmark measures

A public chat-preference evaluation surface with source-defined preference ratings and model comparisons.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is source-defined rating. It should be interpreted within LMArena, not compared as part of a site-wide ranking.

What to be careful about

Ranking-native UX conflicts with evals.report tone. Include only with careful framing.

No composite ranking
evals.report never combines benchmarks. source-defined rating on LMArena is its own number — don’t average it with other metrics.