BenchmarksReasoning
Humanity's Last Exam
A broad expert-level academic question-answering benchmark for frontier reasoning systems.
ReasoningaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Opus 4.8 | Anthropic | 49.8% | Claude Opus 4.8 | Verified | May 28, 2026 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 45.9% | Gemini 3.1 Pro | Official | May 31, 2026 | Details |
| GPT-5.5 | OpenAI | 43.56% | GPT-5.5 | Official | May 31, 2026 | Details |
| Gemini 3.5 Flash | Google DeepMind | 42.5% | Gemini 3.5 Flash | Official | May 31, 2026 | Details |
| GPT-5.4 | OpenAI | 40.28% | GPT-5.4 | Official | May 31, 2026 | Details |
| Claude Opus 4.7 | Anthropic | 39.04% | Opus 4.7 | Official | May 31, 2026 | Details |
| Gemini 3 Pro | Google DeepMind | 38.3% | Gemini 3 Pro | Official | May 31, 2026 | Details |
| Gemini 3 Flash | Google DeepMind | 36.6% | Gemini 3 Flash | Official | May 31, 2026 | Details |
| Claude Opus 4.6 | Anthropic | 34.2% | Opus 4.6 | Official | May 31, 2026 | Details |
| Grok 4.3 | xAI | 33.12% | Grok 4.3 | Official | May 31, 2026 | Details |
| DeepSeek V4 Pro | DeepSeek | 32.4% | DeepSeek 4 Pro | Official | May 31, 2026 | Details |
| Grok 4.2 | xAI | 30.2% | Grok 4.2 | Official | May 31, 2026 | Details |
| Kimi K2.6 | Moonshot AI | 29.9% | Kimi K2.6 | Official | May 31, 2026 | Details |
| GPT-5.2 | OpenAI | 29.9% | GPT-5.2 | Official | May 31, 2026 | Details |
| GPT-5.1 | OpenAI | 27.2% | GPT-5.1 | Official | May 31, 2026 | Details |
| Claude Opus 4.5 | Anthropic | 25.8% | Opus 4.5 | Official | May 31, 2026 | Details |
| GLM-5.1 | Z.ai | 25.63% | GLM 5.1 | Official | May 31, 2026 | Details |
| GPT-5 high | OpenAI | 25.32% | GPT-5 | Official | May 31, 2026 | Details |
| Grok 4 | xAI | 24.52% | Grok 4 | Official | May 31, 2026 | Details |
| Gemini 2.5 Pro | Google DeepMind | 21.64% | Gemini 2.5 Pro | Official | May 31, 2026 | Details |
| Claude Sonnet 4.6 | Anthropic | 21.07% | Sonnet 4.6 | Official | May 31, 2026 | Details |
Each row reports the model’s accuracy on Humanity's Last Exam. Click a row for the full run context.