OpenAI simple-evals

Useful evaluator reference for GPQA, SimpleQA, HealthBench, BrowseComp, MMLU, MATH, MGSM, and DROP.

WatchlistEvaluator referenceReview neededRun guide readyPublic data

Source detail

Score source

Good evaluator code, mixed as a cross-lab official score source.

Run guide

High value for simple local/API runs.

How it can be used

Use as run-guide and evaluator reference for source-linked lab tables.

Caveat

This is not a neutral cross-lab official leaderboard.