evals.report
BenchmarksSourcesLabsCompareRun guides
BenchmarksReasoning

GPQA Diamond

A difficult subset of GPQA for graduate-level science question answering evaluation.

ReasoningaccuracyHigher is better

Known official sources 1

NextManual curatedReview neededRun guide readyCurated source

GPQA Diamond

Widely cited graduate-level science QA benchmark already in product scope.

Category
Reasoning
Owner
GPQA authors
Data path
Use curated source-linked rows from model-system cards and lab release tables.
Known caveat
Prompt format, answer extraction, and subset selection can materially change reported accuracy.