evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

Humanity's Last Exam

A broad expert-level academic question-answering benchmark for frontier reasoning systems.

ReasoningaccuracyHigher is better

What this benchmark measures

A broad expert-level academic question-answering benchmark for frontier reasoning systems.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is accuracy. It should be interpreted within Humanity's Last Exam, not compared as part of a site-wide ranking.

What to be careful about

Avoid stale scraped tables without retrieved-at metadata.

No composite ranking
evals.report never combines benchmarks. accuracy on Humanity's Last Exam is its own number — don’t average it with other metrics.