How to run GPQA Diamond — benchmark guide

Run guidesReasoning

GPQA Diamond is a 448-question graduate-level science multiple-choice set; the score is exact-match accuracy on the A/B/C/D answer. The idavidrein/gpqa repo only hosts the (HF-gated) data and paper scaffolding, so the canonical runnable harness is OpenAI simple-evals: its gpqa_eval.py pulls gpqa_diamond.csv from OpenAI's public Azure mirror and grades answers locally. Keep attached to any score: harness=openai/simple-evals @ commit, sampler/model, --n-repeats value (default 10 via simple_evals.py), whether --examples subsampled, and that this is the Diamond variant.

Benchmark

GPQA Diamond

Repository

github.com/openai/simple-evals

Dataset

huggingface.co/datasets/Idavidrein/gpqa

Metric

accuracy

1Install

shell

git clone https://github.com/openai/simple-evals.git

shell

# simple-evals has no requirements.txt/setup.py; README installs deps individually. openai/anthropic are from the README; pandas+blobfile (Azure CSV fetch) and jinja2 (HTML report) are required by gpqa_eval.py/common.py.
pip install openai anthropic blobfile pandas jinja2

shell

export OPENAI_API_KEY=sk-...   # or ANTHROPIC_API_KEY=... for Anthropic models

2Run evaluation

shell

# List the samplers/models registered in the repo (README-documented).
python -m simple-evals.simple_evals --list-models

shell

# Run only GPQA (Diamond) against a registered model. Add --examples N to subsample, --debug for a 1-repeat smoke test, --n-repeats to override the default of 10.
python -m simple-evals.simple_evals --model <model_name> --eval gpqa

3Expected output

gpqa_eval.py downloads gpqa_diamond.csv from https://openaipublic.blob.core.windows.net/simple-evals/gpqa_diamond.csv (variant defaults to 'diamond'), queries your model (simple_evals.py sets n_repeats=10 unless --n-repeats/--debug given, over the 448 Diamond questions), regex-extracts the chosen letter, and computes exact-match accuracy. simple_evals.py prints the aggregate score and writes a per-eval HTML report + JSON results into the working directory.

4Submit results

GPQA has no central submission/leaderboard; report the accuracy your run prints. Always attach the run context: harness = openai/simple-evals at a pinned git commit, the exact --model / sampler used, --n-repeats (simple_evals.py default 10) and whether --examples subsampled the set, and that the variant is Diamond (gpqa_diamond.csv). Do not compare against numbers produced by a different harness, prompt template, or repeat count.

Gotchas

A prior/competing guide referenced scripts evaluate_gpqa.py and score_gpqa.py in idavidrein/gpqa — these DO NOT EXIST. idavidrein/gpqa is data + paper scaffolding, not a model runner. Use openai/simple-evals.

simple-evals must be run as a module from the PARENT directory of the clone (python -m simple-evals.simple_evals), and the cloned folder must be named simple-evals (the README invocation assumes this). There is no requirements.txt/setup.py/pyproject.toml — install deps individually as shown.

simple-evals does NOT read the gated HF dataset; it pulls gpqa_diamond.csv from OpenAI's public Azure mirror. The canonical HF dataset (Idavidrein/gpqa, CC BY 4.0) is gated — you must accept terms and agree not to post examples online — if you instead want to load it directly.

Note the two different n_repeats defaults: the GPQAEval class __init__ defaults to 4, but simple_evals.py constructs it as (args.n_repeats or 10), so a normal CLI run defaults to 10 repeats (~4480 model calls for full Diamond — expensive). Use --examples and/or --n-repeats to control cost and report whatever you set, since it changes accuracy variance.

--list-models only lists already-registered samplers; it does not wire up a new model. Using a custom model means adding/adjusting a sampler under sampler/ that talks to your API before --model will accept it.