evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

FACTS Grounding

A Google DeepMind benchmark that measures how factually grounded an LLM's long-form responses are to a provided source document, scoring the share of responses that are eligible and fully supported by the context with no hallucinations.

ReasoningGrounding accuracyHigher is better
ModelLabScoreSource modelStatusDate
Gemini 2.0 FlashGoogle DeepMind83.6%VerifiedDec 11, 2024Details
Gemini 1.5 ProGoogle DeepMind80.0%VerifiedFeb 15, 2024Details
Claude 3.5 SonnetAnthropic79.4%VerifiedJun 20, 2024Details
GPT-4oOpenAI78.8%VerifiedMay 13, 2024Details
Gemini 2.5 ProGoogle DeepMind74.2%VerifiedMar 25, 2025Details
Gemini 2.5 FlashGoogle DeepMind69.9%VerifiedApr 17, 2025Details
GPT-5OpenAI69.6%VerifiedAug 7, 2025Details
Gemini 3 ProGoogle DeepMind69.0%VerifiedNov 18, 2025Details
Claude Opus 4.5Anthropic62.1%VerifiedNov 24, 2025Details
Claude Sonnet 4.5Anthropic61.8%VerifiedSep 29, 2025Details
GPT-5 miniOpenAI58.3%VerifiedAug 7, 2025Details
Claude Sonnet 4Anthropic56.1%VerifiedMay 22, 2025Details
Claude Opus 4.1Anthropic54.8%VerifiedAug 5, 2025Details
Grok 4xAI54.7%VerifiedJul 9, 2025Details
GPT-5.1OpenAI50.0%VerifiedNov 12, 2025Details
GPT-4.1OpenAI45.6%VerifiedApr 14, 2025Details
o3OpenAI36.2%VerifiedApr 16, 2025Details
o4-miniOpenAI29.3%VerifiedApr 16, 2025Details

Each row reports the model’s Grounding accuracy on FACTS Grounding. Click a row for the full run context.