Run guidesReasoning
Run GPQA Diamond
The same run guide is also available from the benchmark detail page.
Reasoningaccuracy
Evaluate multiple-choice science questions from the GPQA Diamond subset with a fixed prompt and answer extractor. Keep prompt format, answer extraction, and reasoning policy attached to any reported score.
1Install
shell
git clone https://github.com/idavidrein/gpqa.gitshell
cd gpqashell
python -m venv .venvshell
source .venv/bin/activateshell
pip install datasets pandas2Run evaluation
shell
python evaluate_gpqa.py --subset diamond --model your-model-id --output ./gpqa-diamond.jsonl3Score output
shell
python score_gpqa.py --input ./gpqa-diamond.jsonl4Expected output
A per-question answer file and benchmark-local accuracy. Do not combine this metric with unrelated benchmarks.
5Submit results
Report the exact subset, prompt format, answer extraction method, and whether hidden reasoning was used.
Gotchas
Prompting format and answer extraction can change results.
Use the exact subset and scoring script.
Report whether chain-of-thought or hidden reasoning is used, if applicable.