evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

SuperGPQA

A large-scale knowledge-and-reasoning benchmark of ~26,000 graduate-level multiple-choice questions (up to 10 answer options each) spanning 285 academic disciplines, measuring overall answer accuracy.

ReasoningaccuracyHigher is better
ModelLabScoreSource modelStatusDate
Qwen 3.6 Max PreviewAlibaba / Qwen73.9%UnverifiedApr 20, 2026Details
Qwen3.7 Max PreviewAlibaba / Qwen73.6%UnverifiedMay 14, 2026Details
Qwen 3.6 PlusAlibaba / Qwen71.6%UnverifiedApr 2, 2026Details
Qwen3.5-397B-A17BAlibaba / Qwen70.4%UnverifiedFeb 16, 2026Details
Qwen3 MaxAlibaba / Qwen65.1%UnverifiedSep 5, 2025Details
Qwen3 235B A22B Instruct 2507Alibaba / Qwen62.6%UnverifiedJul 21, 2025Details
DeepSeek R1DeepSeek61.82%VerifiedJan 20, 2025Details
Kimi K2 InstructMoonshot AI57.2%UnverifiedJul 11, 2025Details
Claude 3.5 SonnetAnthropic48.16%VerifiedJun 20, 2024Details
Gemini 2.0 FlashGoogle DeepMind47.73%VerifiedDec 11, 2024Details
DeepSeek V3DeepSeek47.40%VerifiedDec 26, 2024Details
GPT-4oOpenAI44.40%VerifiedMay 13, 2024Details
Llama 3.1 405BMeta43.14%VerifiedJul 23, 2024Details
Mistral LargeMistral AI40.65%VerifiedFeb 26, 2024Details

Each row reports the model’s accuracy on SuperGPQA. Click a row for the full run context.