BenchmarksReasoning
FACTS Grounding
A Google DeepMind benchmark that measures how factually grounded an LLM's long-form responses are to a provided source document, scoring the share of responses that are eligible and fully supported by the context with no hallucinations.
ReasoningGrounding accuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Gemini 2.0 Flash | Google DeepMind | 83.6% | — | Verified | Dec 11, 2024 | Details |
| Gemini 1.5 Pro | Google DeepMind | 80.0% | — | Verified | Feb 15, 2024 | Details |
| Claude 3.5 Sonnet | Anthropic | 79.4% | — | Verified | Jun 20, 2024 | Details |
| GPT-4o | OpenAI | 78.8% | — | Verified | May 13, 2024 | Details |
| Gemini 2.5 Pro | Google DeepMind | 74.2% | — | Verified | Mar 25, 2025 | Details |
| Gemini 2.5 Flash | Google DeepMind | 69.9% | — | Verified | Apr 17, 2025 | Details |
| GPT-5 | OpenAI | 69.6% | — | Verified | Aug 7, 2025 | Details |
| Gemini 3 Pro | Google DeepMind | 69.0% | — | Verified | Nov 18, 2025 | Details |
| Claude Opus 4.5 | Anthropic | 62.1% | — | Verified | Nov 24, 2025 | Details |
| Claude Sonnet 4.5 | Anthropic | 61.8% | — | Verified | Sep 29, 2025 | Details |
| GPT-5 mini | OpenAI | 58.3% | — | Verified | Aug 7, 2025 | Details |
| Claude Sonnet 4 | Anthropic | 56.1% | — | Verified | May 22, 2025 | Details |
| Claude Opus 4.1 | Anthropic | 54.8% | — | Verified | Aug 5, 2025 | Details |
| Grok 4 | xAI | 54.7% | — | Verified | Jul 9, 2025 | Details |
| GPT-5.1 | OpenAI | 50.0% | — | Verified | Nov 12, 2025 | Details |
| GPT-4.1 | OpenAI | 45.6% | — | Verified | Apr 14, 2025 | Details |
| o3 | OpenAI | 36.2% | — | Verified | Apr 16, 2025 | Details |
| o4-mini | OpenAI | 29.3% | — | Verified | Apr 16, 2025 | Details |
Each row reports the model’s Grounding accuracy on FACTS Grounding. Click a row for the full run context.