evals.report
BenchmarksLabsCompareRun guides
evals.reportMethodology

How evals.report labels scores

Every score keeps its source, status, run context, and caveats. Here is what each label means, and why we never average benchmarks or crown a single “best model”.

Score status labels

Official

Taken from the benchmark’s own public leaderboard or a standardized independent evaluation (e.g. Epoch AI, Scale, the benchmark authors). This is the canonical number for a model on that benchmark.

Verified

Reported by the model’s own lab — a launch announcement, model card, or system card — but not (yet) reproduced on a standardized leaderboard. Often the only number available the day a model ships. Vendor-reported launch scores live here.

Unverified

A third-party or aggregator figure that mirrors a standardized eval but isn’t the benchmark’s own leaderboard or a first-party report. Useful signal, lower confidence.

Community

An independent, personal reproduction — someone ran the benchmark themselves and published their setup. Shown nested under the model’s official row with the delta versus official, and never merged or averaged. Each carries its full run context: harness, reasoning effort, token budget, and caveats.

Why scores are never averaged

evals.report never combines benchmarks into a composite or a leaderboard rank. A model’s % resolved on a coding benchmark and its accuracy on a reasoning benchmark are different measurements on different scales — averaging them produces a number with no meaning. Each score is shown only within its own benchmark.

Why unofficial runs stay separate

Official leaderboard scores, vendor-reported launch numbers, and independent community runs are different kinds of evidence. We keep them distinct and labeled — community runs are shown nested under the official row with the delta versus official — so the gap itself is visible rather than hidden inside one averaged figure.

Why doesn’t evals.report average scores or rank models?

Different benchmarks measure different things on different scales, and the same model can score differently under different harnesses. Averaging them invents a number that means nothing. evals.report shows each score within its own benchmark and never produces a composite ranking or a single “best model”.

Why are community and vendor runs kept separate from official scores?

An official leaderboard number, a lab’s self-report, and someone’s independent run are not the same evidence. Merging them would hide where a number came from. We keep each labeled with its source and run context so the differences — which are often the most interesting part — stay visible.

What context does every score row keep?

Model name, source URL, source label, status, evaluation date, benchmark version, reasoning effort, run config, and notes/caveats — plus a retrieval timestamp and content hash for fetched sources.

How do I submit a score or correction?

Reach out on X to @inductive_ml or evals.report’s account with the benchmark, model, score, and a link to the source (a leaderboard, model card, paper, or your own run write-up with setup details). Community runs are welcome as long as the run context is included.