BenchmarksReasoning
GPQA Diamond
A difficult subset of GPQA for graduate-level science question answering evaluation.
ReasoningaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| GPT-5.4 Pro | OpenAI | 94.6% | GPT-5.4 Pro | Official | May 30, 2026 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 94.1% | Gemini 3.1 Pro | Official | May 30, 2026 | Details |
| GPT-5.5 | OpenAI | 94.0% | GPT-5.5 | Official | May 30, 2026 | Details |
| GPT-5.5 Pro | OpenAI | 93.9% | GPT-5.5 Pro | Official | May 30, 2026 | Details |
| Claude Opus 4.8 | Anthropic | 93.6% | Claude Opus 4.8 | Verified | May 28, 2026 | Details |
| GPT-5.4 xHigh | OpenAI | 93.3% | GPT-5.4 | Official | May 30, 2026 | Details |
| Gemini 3.5 Flash | Google DeepMind | 92.8% | Gemini 3.5 Flash | Official | May 30, 2026 | Details |
| Gemini 3 Pro | Google DeepMind | 92.6% | Gemini 3 Pro | Official | May 30, 2026 | Details |
| GPT-5.2 | OpenAI | 91.4% | GPT-5.2 | Official | May 30, 2026 | Details |
| Kimi K2.6 | Moonshot AI | 90.8% | Kimi K2.6 | Official | May 30, 2026 | Details |
| Claude Opus 4.6 | Anthropic | 90.5% | Claude Opus 4.6 | Official | May 30, 2026 | Details |
| Claude Opus 4.7 | Anthropic | 90.2% | Claude Opus 4.7 | Official | May 30, 2026 | Details |
| Muse Spark | Meta | 89.8% | Muse Spark | Official | May 30, 2026 | Details |
| Qwen 3.6 Max Preview | Alibaba / Qwen | 89.1% | Qwen 3.6 Max (Preview) | Official | May 30, 2026 | Details |
| GLM-5 | Z.ai | 87.8% | GLM-5 | Official | May 30, 2026 | Details |
| GPT-5.1 | OpenAI | 87.6% | GPT-5.1 | Official | May 30, 2026 | Details |
| Kimi K2.5 | Moonshot AI | 87.6% | Kimi K2.5 | Official | May 30, 2026 | Details |
| Qwen 3.6 Plus | Alibaba / Qwen | 87.4% | Qwen 3.6 Plus | Official | May 30, 2026 | Details |
| Claude Sonnet 4.6 | Anthropic | 87.4% | Claude Sonnet 4.6 | Official | May 30, 2026 | Details |
| Grok 4 | xAI | 87.0% | Grok 4 | Official | May 30, 2026 | Details |
| GPT-5 high | OpenAI | 86.2% | GPT-5 | Official | May 30, 2026 | Details |
| Claude Opus 4.5 | Anthropic | 86.0% | Claude Opus 4.5 | Official | May 30, 2026 | Details |
| GLM-5.1 | Z.ai | 85.5% | GLM-5.1 | Official | May 30, 2026 | Details |
| Gemini 2.5 Pro | Google DeepMind | 85.3% | Gemini 2.5 Pro (Jun 2025) | Official | May 30, 2026 | Details |
Each row reports the model’s accuracy on GPQA Diamond. Click a row for the full run context.