evals.report
BenchmarksSourcesLabsCompareRun guides
BenchmarksReasoning

Humanity's Last Exam

A broad expert-level academic question-answering benchmark for frontier reasoning systems.

ReasoningaccuracyHigher is better
ModelLabScoreSource modelStatusDate
Claude Opus 4.8Anthropic49.8%Claude Opus 4.8VerifiedMay 28, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind45.9%Gemini 3.1 ProOfficialMay 31, 2026Details
GPT-5.5OpenAI43.56%GPT-5.5OfficialMay 31, 2026Details
Gemini 3.5 FlashGoogle DeepMind42.5%Gemini 3.5 FlashOfficialMay 31, 2026Details
GPT-5.4OpenAI40.28%GPT-5.4OfficialMay 31, 2026Details
Claude Opus 4.7Anthropic39.04%Opus 4.7OfficialMay 31, 2026Details
Gemini 3 ProGoogle DeepMind38.3%Gemini 3 ProOfficialMay 31, 2026Details
Gemini 3 FlashGoogle DeepMind36.6%Gemini 3 FlashOfficialMay 31, 2026Details
Claude Opus 4.6Anthropic34.2%Opus 4.6OfficialMay 31, 2026Details
Grok 4.3xAI33.12%Grok 4.3OfficialMay 31, 2026Details
DeepSeek V4 ProDeepSeek32.4%DeepSeek 4 ProOfficialMay 31, 2026Details
Grok 4.2xAI30.2%Grok 4.2OfficialMay 31, 2026Details
Kimi K2.6Moonshot AI29.9%Kimi K2.6OfficialMay 31, 2026Details
GPT-5.2OpenAI29.9%GPT-5.2OfficialMay 31, 2026Details
GPT-5.1OpenAI27.2%GPT-5.1OfficialMay 31, 2026Details
Claude Opus 4.5Anthropic25.8%Opus 4.5OfficialMay 31, 2026Details
GLM-5.1Z.ai25.63%GLM 5.1OfficialMay 31, 2026Details
GPT-5 highOpenAI25.32%GPT-5OfficialMay 31, 2026Details
Grok 4xAI24.52%Grok 4OfficialMay 31, 2026Details
Gemini 2.5 ProGoogle DeepMind21.64%Gemini 2.5 ProOfficialMay 31, 2026Details
Claude Sonnet 4.6Anthropic21.07%Sonnet 4.6OfficialMay 31, 2026Details

Each row reports the model’s accuracy on Humanity's Last Exam. Click a row for the full run context.