BenchmarksReasoning
SuperGPQA
A large-scale knowledge-and-reasoning benchmark of ~26,000 graduate-level multiple-choice questions (up to 10 answer options each) spanning 285 academic disciplines, measuring overall answer accuracy.
ReasoningaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Qwen 3.6 Max Preview | Alibaba / Qwen | 73.9% | — | Unverified | Apr 20, 2026 | Details |
| Qwen3.7 Max Preview | Alibaba / Qwen | 73.6% | — | Unverified | May 14, 2026 | Details |
| Qwen 3.6 Plus | Alibaba / Qwen | 71.6% | — | Unverified | Apr 2, 2026 | Details |
| Qwen3.5-397B-A17B | Alibaba / Qwen | 70.4% | — | Unverified | Feb 16, 2026 | Details |
| Qwen3 Max | Alibaba / Qwen | 65.1% | — | Unverified | Sep 5, 2025 | Details |
| Qwen3 235B A22B Instruct 2507 | Alibaba / Qwen | 62.6% | — | Unverified | Jul 21, 2025 | Details |
| DeepSeek R1 | DeepSeek | 61.82% | — | Verified | Jan 20, 2025 | Details |
| Kimi K2 Instruct | Moonshot AI | 57.2% | — | Unverified | Jul 11, 2025 | Details |
| Claude 3.5 Sonnet | Anthropic | 48.16% | — | Verified | Jun 20, 2024 | Details |
| Gemini 2.0 Flash | Google DeepMind | 47.73% | — | Verified | Dec 11, 2024 | Details |
| DeepSeek V3 | DeepSeek | 47.40% | — | Verified | Dec 26, 2024 | Details |
| GPT-4o | OpenAI | 44.40% | — | Verified | May 13, 2024 | Details |
| Llama 3.1 405B | Meta | 43.14% | — | Verified | Jul 23, 2024 | Details |
| Mistral Large | Mistral AI | 40.65% | — | Verified | Feb 26, 2024 | Details |
Each row reports the model’s accuracy on SuperGPQA. Click a row for the full run context.