evals.report
BenchmarksSourcesLabsCompareRun guides
BenchmarksReasoning

GPQA Diamond

A difficult subset of GPQA for graduate-level science question answering evaluation.

ReasoningaccuracyHigher is better
ModelLabScoreSource modelStatusDate
GPT-5.4 ProOpenAI94.6%GPT-5.4 ProOfficialMay 30, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind94.1%Gemini 3.1 ProOfficialMay 30, 2026Details
GPT-5.5OpenAI94.0%GPT-5.5OfficialMay 30, 2026Details
GPT-5.5 ProOpenAI93.9%GPT-5.5 ProOfficialMay 30, 2026Details
Claude Opus 4.8Anthropic93.6%Claude Opus 4.8VerifiedMay 28, 2026Details
GPT-5.4 xHighOpenAI93.3%GPT-5.4OfficialMay 30, 2026Details
Gemini 3.5 FlashGoogle DeepMind92.8%Gemini 3.5 FlashOfficialMay 30, 2026Details
Gemini 3 ProGoogle DeepMind92.6%Gemini 3 ProOfficialMay 30, 2026Details
GPT-5.2OpenAI91.4%GPT-5.2OfficialMay 30, 2026Details
Kimi K2.6Moonshot AI90.8%Kimi K2.6OfficialMay 30, 2026Details
Claude Opus 4.6Anthropic90.5%Claude Opus 4.6OfficialMay 30, 2026Details
Claude Opus 4.7Anthropic90.2%Claude Opus 4.7OfficialMay 30, 2026Details
Muse SparkMeta89.8%Muse SparkOfficialMay 30, 2026Details
Qwen 3.6 Max PreviewAlibaba / Qwen89.1%Qwen 3.6 Max (Preview)OfficialMay 30, 2026Details
GLM-5Z.ai87.8%GLM-5OfficialMay 30, 2026Details
GPT-5.1OpenAI87.6%GPT-5.1OfficialMay 30, 2026Details
Kimi K2.5Moonshot AI87.6%Kimi K2.5OfficialMay 30, 2026Details
Qwen 3.6 PlusAlibaba / Qwen87.4%Qwen 3.6 PlusOfficialMay 30, 2026Details
Claude Sonnet 4.6Anthropic87.4%Claude Sonnet 4.6OfficialMay 30, 2026Details
Grok 4xAI87.0%Grok 4OfficialMay 30, 2026Details
GPT-5 highOpenAI86.2%GPT-5OfficialMay 30, 2026Details
Claude Opus 4.5Anthropic86.0%Claude Opus 4.5OfficialMay 30, 2026Details
GLM-5.1Z.ai85.5%GLM-5.1OfficialMay 30, 2026Details
Gemini 2.5 ProGoogle DeepMind85.3%Gemini 2.5 Pro (Jun 2025)OfficialMay 30, 2026Details

Each row reports the model’s accuracy on GPQA Diamond. Click a row for the full run context.