SourcesOther
OpenAI simple-evals
Useful evaluator reference for GPQA, SimpleQA, HealthBench, BrowseComp, MMLU, MATH, MGSM, and DROP.
WatchlistEvaluator referenceReview neededRun guide readyPublic data
Source detail
Score source
Good evaluator code, mixed as a cross-lab official score source.
Run guide
High value for simple local/API runs.
How it can be used
Use as run-guide and evaluator reference for source-linked lab tables.
Caveat
This is not a neutral cross-lab official leaderboard.