evals.report
BenchmarksSourcesLabsCompareRun guides
SourcesChat preference

LMArena

Public preference signal users care about.

LaterHF datasetWatchlistNo run guidePage-backed data
Official source Benchmark page

Source detail

Score source

Public leaderboard and HF datasets exist, but API/source stability needs verification.

Run guide

Not a normal run-locally benchmark for most users.

How it can be used

Treat Arena score/Elo-style ratings as benchmark-specific metrics only.

Caveat

Ranking-native UX conflicts with evals.report tone. Include only with careful framing.

Evidence links 1