SourcesReasoning
GPQA Diamond
Widely cited graduate-level science QA benchmark already in product scope.
NextManual curatedReview neededRun guide readyCurated source
Source detail
Score source
Benchmark data and baselines are public, but there is no single canonical cross-lab leaderboard.
Run guide
Benchmark repo and simple-evals GPQA runner are useful for execution.
How it can be used
Use curated source-linked rows from model-system cards and lab release tables.
Caveat
Prompting, answer extraction, and chain-of-thought policy can materially change scores.