evals.report
BenchmarksSourcesLabsCompareRun guides
SourcesOther

OpenAI simple-evals

Useful evaluator reference for GPQA, SimpleQA, HealthBench, BrowseComp, MMLU, MATH, MGSM, and DROP.

WatchlistEvaluator referenceReview neededRun guide readyPublic data
Official source

Source detail

Score source

Good evaluator code, mixed as a cross-lab official score source.

Run guide

High value for simple local/API runs.

How it can be used

Use as run-guide and evaluator reference for source-linked lab tables.

Caveat

This is not a neutral cross-lab official leaderboard.

Evidence links 1