evals.report
BenchmarksSourcesLabsCompareRun guides
BenchmarksReasoning

GPQA Diamond

A difficult subset of GPQA for graduate-level science question answering evaluation.

ReasoningaccuracyHigher is better

What this benchmark measures

GPQA Diamond focuses on difficult graduate-level science questions. Evaluations typically report accuracy on a fixed subset with a specified prompt format and answer extraction method.

Rows on this page come from public model-system reports rather than a single central GPQA leaderboard. Treat the prompt format, answer extraction, and reasoning policy as part of the score context.

The metric shown here is benchmark-local accuracy. It should be compared within GPQA Diamond only.

What to be careful about

Prompt format, answer extraction, and subset selection can materially change reported accuracy.

No composite ranking
evals.report never combines benchmarks. accuracy on GPQA Diamond is its own number — don’t average it with other metrics.