BenchmarksReasoning
Humanity's Last Exam
A broad expert-level academic question-answering benchmark for frontier reasoning systems.
ReasoningaccuracyHigher is better
Dataset/eval access is public enough to document, but official run details vary.
1Expected output
Use the official source links for current output format, submission steps, and benchmark-specific result files.
2Submit results
Keep source URL, source model name, benchmark version, harness, and run context attached to any reported score.
Gotchas
Avoid stale scraped tables without retrieved-at metadata.
Do not mix this benchmark's metric with unrelated benchmark metrics.