evals.report
BenchmarksSourcesLabsCompareRun guides
BenchmarksReasoning

GPQA Diamond

A difficult subset of GPQA for graduate-level science question answering evaluation.

ReasoningaccuracyHigher is better

Evaluate multiple-choice science questions from the GPQA Diamond subset with a fixed prompt and answer extractor. Keep prompt format, answer extraction, and reasoning policy attached to any reported score.

Benchmark
GPQA Diamond
Dataset
huggingface.co/datasets/Idavidrein/gpqa
Metric
accuracy

1Install

shell
git clone https://github.com/idavidrein/gpqa.git
shell
cd gpqa
shell
python -m venv .venv
shell
source .venv/bin/activate
shell
pip install datasets pandas

2Run evaluation

shell
python evaluate_gpqa.py --subset diamond --model your-model-id --output ./gpqa-diamond.jsonl

3Score output

shell
python score_gpqa.py --input ./gpqa-diamond.jsonl

4Expected output

A per-question answer file and benchmark-local accuracy. Do not combine this metric with unrelated benchmarks.

5Submit results

Report the exact subset, prompt format, answer extraction method, and whether hidden reasoning was used.

Gotchas

Prompting format and answer extraction can change results.
Use the exact subset and scoring script.
Report whether chain-of-thought or hidden reasoning is used, if applicable.