Question 1

What is GPQA Diamond?

Accepted Answer

A difficult subset of GPQA for graduate-level science question answering evaluation. It is a reasoning benchmark measured by accuracy.

Question 2

What does accuracy mean on GPQA Diamond?

Accepted Answer

GPQA Diamond reports accuracy (%); higher is better. Scores are shown only within GPQA Diamond and are never averaged with other benchmarks.

Question 3

What is the top reported GPQA Diamond score?

Accepted Answer

Fugu has the top reported score on GPQA Diamond: 95.5% (accuracy).

Question 4

Why do GPQA Diamond scores differ across runs?

Accepted Answer

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Question 5

Does evals.report rank models across benchmarks?

Accepted Answer

No. GPQA Diamond scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

GPQA Diamond

What this benchmark measures

What to be careful about

Frequently asked