evals.report
BenchmarksLabsCompareRun guides
BenchmarksChat preference

Search Arena

A crowdsourced human-preference leaderboard from LMArena that ranks search-augmented LLMs via blind pairwise votes on grounded, web-search answers, reported as Bradley-Terry Elo-scale ratings.

Chat preferenceEloHigher is better

What this benchmark measures

A crowdsourced human-preference leaderboard from LMArena that ranks search-augmented LLMs via blind pairwise votes on grounded, web-search answers, reported as Bradley-Terry Elo-scale ratings.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is Elo. It should be interpreted within Search Arena, not compared as part of a site-wide ranking.

No composite ranking
evals.report never combines benchmarks. Elo on Search Arena is its own number — don’t average it with other metrics.