What is Humanity's Last Exam?

A broad expert-level academic question-answering benchmark for frontier reasoning systems. It is a reasoning benchmark measured by accuracy.

What does accuracy mean on Humanity's Last Exam?

Humanity's Last Exam reports accuracy (%); higher is better. Scores are shown only within Humanity's Last Exam and are never averaged with other benchmarks.

What is the top reported Humanity's Last Exam score?

Claude Mythos Preview has the top reported score on Humanity's Last Exam: 56.8% (accuracy).

Why do Humanity's Last Exam scores differ across runs?

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Does evals.report rank models across benchmarks?

No. Humanity's Last Exam scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

BenchmarksReasoning

Humanity's Last Exam

A broad expert-level academic question-answering benchmark for frontier reasoning systems.

ReasoningaccuracyHigher is better

Scores About Run this benchmark

Humanity's Last Exam (HLE) is run with the official centerforaisafety/hle harness: load the gated cais/hle test set (2,500 questions, text + image), generate predictions against your own OpenAI-compatible model endpoint with run_model_predictions.py, then grade them with run_judge_results.py, which uses an LLM judge (default o3-mini-2025-01-31) to emit accuracy and calibration error. The headline metric is accuracy (with a 95% Wald CI). Keep attached to any score: target model name, judge model, max_completion_tokens, temperature setting, whether the full 2,500-question test split was used, and the harness commit.

Benchmark

Humanity's Last Exam

Repository

github.com/centerforaisafety/hle

Dataset

huggingface.co/datasets/cais/hle

Metric

accuracy

1Install

shell

git clone https://github.com/centerforaisafety/hle.git

shell

cd hle

shell

pip install -r requirements.txt

shell

export OPENAI_API_KEY=your_openai_key_here

shell

huggingface-cli login  # dataset cais/hle is gated; accept terms on the HF page first

2Run evaluation

shell

cd hle_eval

shell

MODEL="gpt-4o-2024-11-20"

shell

DATASET="cais/hle"

shell

python run_model_predictions.py --dataset ${DATASET} --model ${MODEL} --max_completion_tokens 8192 --num_workers 100

shell

# add --max_samples 3 to smoke-test before a full run

3Score output

shell

python run_judge_results.py --dataset ${DATASET} --predictions hle_${MODEL}.json --num_workers 100

shell

# override judge with --judge o3-mini-2025-01-31 (default) if desired

4Expected output

run_model_predictions.py writes predictions to hle_${MODEL}.json (e.g. hle_gpt-4o-2024-11-20.json) in hle_eval/, skipping already-answered questions on re-runs. run_judge_results.py writes judged_hle_${MODEL}.json and prints a '*** Metrics ***' block with 'Accuracy: X% +/- Y% | n = N' and 'Calibration Error: Z' (accuracy normalized over the full question count N). Accuracy is the headline metric; do not compare it against scores produced with a different judge model, token budget, or subset of questions.

5Submit results

There is no automated public submission endpoint in the harness; results are self-reported. Report accuracy with its 95% CI and calibration error, plus the run context: target model name, judge model (default o3-mini-2025-01-31), --max_completion_tokens, temperature setting, full vs. subset of the 2,500-question test split, and the harness commit. The official leaderboard at lastexam.ai tracks frontier-model results; contact agibenchmark@safe.ai for listing.

Gotchas

The cais/hle dataset is gated: you must accept terms on the HuggingFace page and run huggingface-cli login before load_dataset('cais/hle', split='test') will work.

Both scripts use AsyncOpenAI and read OPENAI_API_KEY from the environment (the README has no explicit export line). To evaluate a non-OpenAI model you must point the client at an OpenAI-compatible base_url; the judge step still calls OpenAI's o3-mini-2025-01-31 by default, so an OpenAI key is needed for grading regardless of the target model.

Do not set --max_completion_tokens below 8192 for reasoning models; the README warns this causes model collapse. The --temperature flag (default 0.0) exists but is commented out in the actual API call in run_model_predictions.py, so temperature is not currently transmitted to the model.

The dataset is multimodal (text + image); ensure your target model endpoint accepts image inputs or many questions will fail. --num_workers must be >= 2 (asserted in BOTH run_model_predictions.py and run_judge_results.py); its default is 10 for predictions and 100 for the judge, and the README example passes 100 explicitly to both.