BenchmarksOther
MASK (Model Alignment between Statements and Knowledge)
A human-collected honesty benchmark that first elicits a model's beliefs, then measures whether the model maintains truthful assertions when directly or indirectly pressured to lie, disentangling honesty from factual accuracy.
OtherHonesty scoreHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Opus 4.6 | Anthropic | 96.28 | — | Verified | Feb 5, 2026 | Details |
| Claude Sonnet 4.5 | Anthropic | 96.13 | — | Verified | Sep 29, 2025 | Details |
| Claude Sonnet 4 | Anthropic | 95.33 | — | Verified | May 22, 2025 | Details |
| Claude Opus 4.1 | Anthropic | 94.20 | — | Verified | Aug 5, 2025 | Details |
| Claude Opus 4.5 | Anthropic | 92.53 | — | Verified | Nov 24, 2025 | Details |
| GPT-OSS-120B | OpenAI | 92.00 | — | Verified | Aug 5, 2025 | Details |
| GPT-5.4 Pro | OpenAI | 91.73 | — | Verified | Mar 5, 2026 | Details |
| GPT-5.4 | OpenAI | 89.67 | — | Verified | Mar 5, 2026 | Details |
| Claude Opus 4 | Anthropic | 87.87 | — | Verified | May 22, 2025 | Details |
| GPT-5.2 | OpenAI | 86.67 | — | Verified | Dec 11, 2025 | Details |
| GPT-5.1 | OpenAI | 86.33 | — | Verified | Nov 12, 2025 | Details |
| o3 | OpenAI | 84.47 | — | Verified | Apr 16, 2025 | Details |
| GPT-5 mini | OpenAI | 82.60 | — | Verified | Aug 7, 2025 | Details |
| OpenAI o3-pro | OpenAI | 82.50 | — | Verified | Jun 10, 2025 | Details |
| Claude 3.7 Sonnet | Anthropic | 82.13 | — | Verified | Feb 24, 2025 | Details |
| GPT-5 | OpenAI | 79.33 | — | Verified | Aug 7, 2025 | Details |
| o4-mini | OpenAI | 78.60 | — | Verified | Apr 16, 2025 | Details |
| Claude 3.5 Sonnet | Anthropic | 72.33 | — | Verified | Jun 20, 2024 | Details |
| Kimi K2.5 | Moonshot AI | 70.47 | — | Verified | Jan 27, 2026 | Details |
| Llama 3.1 405B | Meta | 61.40 | — | Verified | Jul 23, 2024 | Details |
| GPT-4o | OpenAI | 60.07 | — | Verified | May 13, 2024 | Details |
| DeepSeek R1 | DeepSeek | 57.32 | — | Verified | Jan 20, 2025 | Details |
| Gemini 2.5 Pro | Google DeepMind | 55.67 | — | Verified | Mar 25, 2025 | Details |
| GPT-4.1 | OpenAI | 51.13 | — | Verified | Apr 14, 2025 | Details |
| Llama 4 Maverick | Meta | 49.73 | — | Verified | Apr 5, 2025 | Details |
| Gemini 2.5 Flash | Google DeepMind | 49.13 | — | Verified | Apr 17, 2025 | Details |
| Gemini 2.0 Flash | Google DeepMind | 49.07 | — | Verified | Dec 11, 2024 | Details |
| Mistral Large | Mistral AI | 47.53 | — | Verified | Feb 26, 2024 | Details |
| Kimi K2 Instruct | Moonshot AI | 46.67 | — | Verified | Jul 11, 2025 | Details |
| DeepSeek V3.1 | DeepSeek | 46.27 | — | Verified | Aug 21, 2025 | Details |
| DeepSeek V3 0324 | DeepSeek | 44.53 | — | Verified | Mar 24, 2025 | Details |
| Gemini 3 Pro | Google DeepMind | 42.60 | — | Verified | Nov 18, 2025 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 42.40 | — | Verified | Feb 19, 2026 | Details |
Each row reports the model’s Honesty score on MASK (Model Alignment between Statements and Knowledge). Click a row for the full run context.