Question 1

Why doesn’t evals.report average scores or rank models?

Accepted Answer

Different benchmarks measure different things on different scales, and the same model can score differently under different harnesses. Averaging them invents a number that means nothing. evals.report shows each score within its own benchmark and never produces a composite ranking or a single “best model”.

Question 2

Why are community and vendor runs kept separate from official scores?

Accepted Answer

An official leaderboard number, a lab’s self-report, and someone’s independent run are not the same evidence. Merging them would hide where a number came from. We keep each labeled with its source and run context so the differences — which are often the most interesting part — stay visible.

Question 3

What context does every score row keep?

Accepted Answer

Model name, source URL, source label, status, evaluation date, benchmark version, reasoning effort, run config, and notes/caveats — plus a retrieval timestamp and content hash for fetched sources.

Question 4

How do I submit a score or correction?

Accepted Answer

Reach out on X to @inductive_ml or evals.report’s account with the benchmark, model, score, and a link to the source (a leaderboard, model card, paper, or your own run write-up with setup details). Community runs are welcome as long as the run context is included.

How evals.report labels scores

Score status labels

Why scores are never averaged

Why unofficial runs stay separate

Why doesn’t evals.report average scores or rank models?

Why are community and vendor runs kept separate from official scores?

What context does every score row keep?

How do I submit a score or correction?