evals.report
BenchmarksSourcesLabsCompareRun guides

SimpleQA Verified

A factual short-answer QA benchmark measuring parametric knowledge and hallucination resistance (Epoch AI's SimpleQA Verified).

OtheraccuracyHigher is better
ModelLabScoreSource modelStatusDate
Gemini 3.1 Pro PreviewGoogle DeepMind77.3%Gemini 3.1 ProOfficialMay 30, 2026Details
Gemini 3 ProGoogle DeepMind72.9%Gemini 3 ProOfficialMay 30, 2026Details
Gemini 3.5 FlashGoogle DeepMind68.4%Gemini 3.5 FlashOfficialMay 30, 2026Details
Qwen3 MaxAlibaba / Qwen67.5%Qwen3-MaxOfficialMay 30, 2026Details
Gemini 3 FlashGoogle DeepMind67.4%Gemini 3 FlashOfficialMay 30, 2026Details
Muse SparkMeta66.3%Muse SparkOfficialMay 30, 2026Details
GPT-5.5 ProOpenAI64.5%GPT-5.5 ProOfficialMay 30, 2026Details
GPT-5.5OpenAI63.1%GPT-5.5OfficialMay 30, 2026Details
Qwen 3.6 Max PreviewAlibaba / Qwen56.9%Qwen 3.6 Max (Preview)OfficialMay 30, 2026Details
Gemini 2.5 ProGoogle DeepMind56.0%Gemini 2.5 Pro (Jun 2025)OfficialMay 30, 2026Details
o3OpenAI53.0%o3OfficialMay 30, 2026Details
Claude Opus 4.7Anthropic50.6%Claude Opus 4.7OfficialMay 30, 2026Details
GPT-5 highOpenAI50.6%GPT-5OfficialMay 30, 2026Details
Qwen3 235B A22B Instruct 2507Alibaba / Qwen50.1%Qwen3-235B-A22B (Jul 2025)OfficialMay 30, 2026Details
Qwen 3.6 PlusAlibaba / Qwen49.1%Qwen 3.6 PlusOfficialMay 30, 2026Details
GPT-5.1OpenAI48.9%GPT-5.1OfficialMay 30, 2026Details
Grok 4xAI47.9%Grok 4OfficialMay 30, 2026Details
GPT-5.4 ProOpenAI47.8%GPT-5.4 ProOfficialMay 30, 2026Details
Claude Opus 4.6Anthropic46.5%Claude Opus 4.6OfficialMay 30, 2026Details
GPT-5.4 xHighOpenAI44.8%GPT-5.4OfficialMay 30, 2026Details

Each row reports the model’s accuracy on SimpleQA Verified. Click a row for the full run context.