How evals.report labels scores
Every score keeps its source, status, run context, and caveats. Here is what each label means, and why we never average benchmarks or crown a single “best model”.
Score status labels
Why scores are never averaged
evals.report never combines benchmarks into a composite or a leaderboard rank. A model’s % resolved on a coding benchmark and its accuracy on a reasoning benchmark are different measurements on different scales — averaging them produces a number with no meaning. Each score is shown only within its own benchmark.
Why unofficial runs stay separate
Official leaderboard scores, vendor-reported launch numbers, and independent community runs are different kinds of evidence. We keep them distinct and labeled — community runs are shown nested under the official row with the delta versus official — so the gap itself is visible rather than hidden inside one averaged figure.
Why doesn’t evals.report average scores or rank models?
Different benchmarks measure different things on different scales, and the same model can score differently under different harnesses. Averaging them invents a number that means nothing. evals.report shows each score within its own benchmark and never produces a composite ranking or a single “best model”.
Why are community and vendor runs kept separate from official scores?
An official leaderboard number, a lab’s self-report, and someone’s independent run are not the same evidence. Merging them would hide where a number came from. We keep each labeled with its source and run context so the differences — which are often the most interesting part — stay visible.
What context does every score row keep?
Model name, source URL, source label, status, evaluation date, benchmark version, reasoning effort, run config, and notes/caveats — plus a retrieval timestamp and content hash for fetched sources.
How do I submit a score or correction?
Reach out on X to @inductive_ml or evals.report’s account with the benchmark, model, score, and a link to the source (a leaderboard, model card, paper, or your own run write-up with setup details). Community runs are welcome as long as the run context is included.