What is GPQA Diamond?

A difficult subset of GPQA for graduate-level science question answering evaluation. It is a reasoning benchmark measured by accuracy.

What does accuracy mean on GPQA Diamond?

GPQA Diamond reports accuracy (%); higher is better. Scores are shown only within GPQA Diamond and are never averaged with other benchmarks.

What is the top reported GPQA Diamond score?

Fugu has the top reported score on GPQA Diamond: 95.5% (accuracy).

Why do GPQA Diamond scores differ across runs?

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Does evals.report rank models across benchmarks?

No. GPQA Diamond scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

BenchmarksReasoning

GPQA Diamond

A difficult subset of GPQA for graduate-level science question answering evaluation.

ReasoningaccuracyHigher is better

Scores About Run this benchmark

GPQA Diamond is a 448-question graduate-level science multiple-choice set; the score is exact-match accuracy on the A/B/C/D answer. The idavidrein/gpqa repo only hosts the (HF-gated) data and paper scaffolding, so the canonical runnable harness is OpenAI simple-evals: its gpqa_eval.py pulls gpqa_diamond.csv from OpenAI's public Azure mirror and grades answers locally. Keep attached to any score: harness=openai/simple-evals @ commit, sampler/model, --n-repeats value (default 10 via simple_evals.py), whether --examples subsampled, and that this is the Diamond variant.

Benchmark

GPQA Diamond

Repository

github.com/openai/simple-evals

Dataset

huggingface.co/datasets/Idavidrein/gpqa

Metric

accuracy

1Install

shell

git clone https://github.com/openai/simple-evals.git

shell

# simple-evals has no requirements.txt/setup.py; README installs deps individually. openai/anthropic are from the README; pandas+blobfile (Azure CSV fetch) and jinja2 (HTML report) are required by gpqa_eval.py/common.py.
pip install openai anthropic blobfile pandas jinja2

shell

export OPENAI_API_KEY=sk-...   # or ANTHROPIC_API_KEY=... for Anthropic models

2Run evaluation

shell

# List the samplers/models registered in the repo (README-documented).
python -m simple-evals.simple_evals --list-models

shell

# Run only GPQA (Diamond) against a registered model. Add --examples N to subsample, --debug for a 1-repeat smoke test, --n-repeats to override the default of 10.
python -m simple-evals.simple_evals --model <model_name> --eval gpqa

3Expected output

gpqa_eval.py downloads gpqa_diamond.csv from https://openaipublic.blob.core.windows.net/simple-evals/gpqa_diamond.csv (variant defaults to 'diamond'), queries your model (simple_evals.py sets n_repeats=10 unless --n-repeats/--debug given, over the 448 Diamond questions), regex-extracts the chosen letter, and computes exact-match accuracy. simple_evals.py prints the aggregate score and writes a per-eval HTML report + JSON results into the working directory.

4Submit results

GPQA has no central submission/leaderboard; report the accuracy your run prints. Always attach the run context: harness = openai/simple-evals at a pinned git commit, the exact --model / sampler used, --n-repeats (simple_evals.py default 10) and whether --examples subsampled the set, and that the variant is Diamond (gpqa_diamond.csv). Do not compare against numbers produced by a different harness, prompt template, or repeat count.

Gotchas

A prior/competing guide referenced scripts evaluate_gpqa.py and score_gpqa.py in idavidrein/gpqa — these DO NOT EXIST. idavidrein/gpqa is data + paper scaffolding, not a model runner. Use openai/simple-evals.

simple-evals must be run as a module from the PARENT directory of the clone (python -m simple-evals.simple_evals), and the cloned folder must be named simple-evals (the README invocation assumes this). There is no requirements.txt/setup.py/pyproject.toml — install deps individually as shown.

simple-evals does NOT read the gated HF dataset; it pulls gpqa_diamond.csv from OpenAI's public Azure mirror. The canonical HF dataset (Idavidrein/gpqa, CC BY 4.0) is gated — you must accept terms and agree not to post examples online — if you instead want to load it directly.

Note the two different n_repeats defaults: the GPQAEval class __init__ defaults to 4, but simple_evals.py constructs it as (args.n_repeats or 10), so a normal CLI run defaults to 10 repeats (~4480 model calls for full Diamond — expensive). Use --examples and/or --n-repeats to control cost and report whatever you set, since it changes accuracy variance.

--list-models only lists already-registered samplers; it does not wire up a new model. Using a custom model means adding/adjusting a sampler under sampler/ that talks to your API before --model will accept it.