What does accuracy mean on SimpleQA Verified?

SimpleQA Verified reports accuracy (%); higher is better. Scores are shown only within SimpleQA Verified and are never averaged with other benchmarks.

What is the top reported SimpleQA Verified score?

Gemini 3.1 Pro Preview has the top reported score on SimpleQA Verified: 77.3% (accuracy).

Why do SimpleQA Verified scores differ across runs?

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Does evals.report rank models across benchmarks?

No. SimpleQA Verified scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

BenchmarksOther

SimpleQA Verified

A factual short-answer QA benchmark measuring parametric knowledge and hallucination resistance (Epoch AI's SimpleQA Verified).

OtheraccuracyHigher is better

Scores About Run this benchmark

SimpleQA Verified is Google DeepMind/Google Research's curated 1,000-question version of OpenAI SimpleQA, measuring short-form parametric factuality with no tools. The canonical locally-runnable harness is UK AISI's Inspect Evals, which provides an `inspect_evals/simpleqa_verified` task that runs your model on all 1k prompts and grades responses with a GPT-4.1 autorater into correct/incorrect/not-attempted. When reporting a score, state which scorer (tool-calling default vs paper-faithful `-T scorer=original`), which grader model, and which metric (Epoch uses plain proportion-correct; the paper uses an F1 harmonic mean) you used, plus the Inspect Evals task version and dataset revision.

Benchmark

SimpleQA Verified

Repository

github.com/UKGovernmentBEIS/inspect_evals

Dataset

huggingface.co/datasets/codelion/SimpleQA-Verified

Metric

accuracy

1Install

shell

pip install inspect-evals

shell

export OPENAI_API_KEY=<your-openai-key>   # grader uses an OpenAI GPT-4.1 model

shell

export ANTHROPIC_API_KEY=<your-key>        # only if evaluating an Anthropic model

2Run evaluation

shell

# Quick smoke test (default tool-calling scorer; evaluated model also grades if no grader role is bound)
inspect eval inspect_evals/simpleqa_verified --model openai/gpt-4o-mini --limit 10

shell

# Full paper-faithful reproduction run (string-matching scorer + GPT-4.1 grader)
inspect eval inspect_evals/simpleqa_verified -T scorer=original --generate-config src/inspect_evals/simpleqa/paper_config/simpleqa_verified.yaml --model-role 'grader={"model": "openai/gpt-4.1-2025-04-14", "temperature": 1.0}' --model openai/gpt-4o-mini

3Score output

shell

# Scoring is folded into the run; view results/logs with:
inspect view

4Expected output

An Inspect .eval log (viewable via `inspect view`) reporting proportion correct, incorrect, not-attempted, accuracy-given-attempted, and the F-score (harmonic mean of correct and correct-given-attempted) over the 1,000 prompts. The headline 'accuracy' is the proportion correct. Do NOT compare an Inspect F-score against Epoch's leaderboard number directly: Epoch reports plain proportion-correct while the paper/Google report the F1 harmonic mean, so the two metrics differ; always state which one you used.

5Submit results

There is no open submission API for local runs. The official leaderboard is Google DeepMind's Kaggle benchmark (https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified) and Epoch AI maintains its own runs at https://epoch.ai/benchmarks/simple-qa-verified. For a self-reported score, attach: the Inspect Evals task version (e.g. changelog tag 3-B), the dataset HF revision (codelion/SimpleQA-Verified rev 5a913f57326d89935cbed0ac071494e7e624b876), the scorer used (tool-calling default vs `-T scorer=original`), the grader model (openai/gpt-4.1-2025-04-14, temp 1.0), generation settings (max-tokens), and which metric you report (proportion-correct vs F1).

Gotchas

The Inspect harness pulls the dataset from the codelion/SimpleQA-Verified mirror (pinned to revision 5a913f57...), NOT from the official google/simpleqa-verified HF dataset; both are the same 1,000 rows but cite the mirror revision when reporting.

Two different metrics circulate: Epoch reports plain proportion-correct, while Google's paper and Kaggle leaderboard report the F1 harmonic mean of correct and correct-given-attempted. Mixing them produces a wrong comparison.

Two scorers exist: the default 'tool' (schema_tool_graded_scorer) and the paper-faithful 'original' string-matching scorer (`-T scorer=original`). They give different numbers; the paper-faithful reproduction requires `--generate-config .../paper_config/simpleqa_verified.yaml` plus the GPT-4.1 grader via `--model-role`. If no grader role is bound, the evaluated model grades itself.

This is a no-tools benchmark by design (search/retrieval makes it trivial). Do not enable web/tool access for the evaluated model. Note gpt-5-nano ignores temperature (defaults to 1), and the README sets --max-tokens 128000 for the full run since the paper does not specify a limit.