How to run Humanity's Last Exam — benchmark guide

Run guidesReasoning

Humanity's Last Exam (HLE) is run with the official centerforaisafety/hle harness: load the gated cais/hle test set (2,500 questions, text + image), generate predictions against your own OpenAI-compatible model endpoint with run_model_predictions.py, then grade them with run_judge_results.py, which uses an LLM judge (default o3-mini-2025-01-31) to emit accuracy and calibration error. The headline metric is accuracy (with a 95% Wald CI). Keep attached to any score: target model name, judge model, max_completion_tokens, temperature setting, whether the full 2,500-question test split was used, and the harness commit.

Benchmark

Humanity's Last Exam

Repository

github.com/centerforaisafety/hle

Dataset

huggingface.co/datasets/cais/hle

Metric

accuracy

1Install

shell

git clone https://github.com/centerforaisafety/hle.git

shell

cd hle

shell

pip install -r requirements.txt

shell

export OPENAI_API_KEY=your_openai_key_here

shell

huggingface-cli login  # dataset cais/hle is gated; accept terms on the HF page first

2Run evaluation

shell

cd hle_eval

shell

MODEL="gpt-4o-2024-11-20"

shell

DATASET="cais/hle"

shell

python run_model_predictions.py --dataset ${DATASET} --model ${MODEL} --max_completion_tokens 8192 --num_workers 100

shell

# add --max_samples 3 to smoke-test before a full run

3Score output

shell

python run_judge_results.py --dataset ${DATASET} --predictions hle_${MODEL}.json --num_workers 100

shell

# override judge with --judge o3-mini-2025-01-31 (default) if desired

4Expected output

run_model_predictions.py writes predictions to hle_${MODEL}.json (e.g. hle_gpt-4o-2024-11-20.json) in hle_eval/, skipping already-answered questions on re-runs. run_judge_results.py writes judged_hle_${MODEL}.json and prints a '*** Metrics ***' block with 'Accuracy: X% +/- Y% | n = N' and 'Calibration Error: Z' (accuracy normalized over the full question count N). Accuracy is the headline metric; do not compare it against scores produced with a different judge model, token budget, or subset of questions.

5Submit results

There is no automated public submission endpoint in the harness; results are self-reported. Report accuracy with its 95% CI and calibration error, plus the run context: target model name, judge model (default o3-mini-2025-01-31), --max_completion_tokens, temperature setting, full vs. subset of the 2,500-question test split, and the harness commit. The official leaderboard at lastexam.ai tracks frontier-model results; contact agibenchmark@safe.ai for listing.

Gotchas

The cais/hle dataset is gated: you must accept terms on the HuggingFace page and run huggingface-cli login before load_dataset('cais/hle', split='test') will work.

Both scripts use AsyncOpenAI and read OPENAI_API_KEY from the environment (the README has no explicit export line). To evaluate a non-OpenAI model you must point the client at an OpenAI-compatible base_url; the judge step still calls OpenAI's o3-mini-2025-01-31 by default, so an OpenAI key is needed for grading regardless of the target model.

Do not set --max_completion_tokens below 8192 for reasoning models; the README warns this causes model collapse. The --temperature flag (default 0.0) exists but is commented out in the actual API call in run_model_predictions.py, so temperature is not currently transmitted to the model.

The dataset is multimodal (text + image); ensure your target model endpoint accepts image inputs or many questions will fail. --num_workers must be >= 2 (asserted in BOTH run_model_predictions.py and run_judge_results.py); its default is 10 for predictions and 100 for the judge, and the README example passes 100 explicitly to both.