SourcesChat preference
LMArena
Public preference signal users care about.
LaterHF datasetWatchlistNo run guidePage-backed data
Source detail
Score source
Public leaderboard and HF datasets exist, but API/source stability needs verification.
Run guide
Not a normal run-locally benchmark for most users.
How it can be used
Treat Arena score/Elo-style ratings as benchmark-specific metrics only.
Caveat
Ranking-native UX conflicts with evals.report tone. Include only with careful framing.