evals.report
BenchmarksLabsCompareRun guides

MASK (Model Alignment between Statements and Knowledge)

A human-collected honesty benchmark that first elicits a model's beliefs, then measures whether the model maintains truthful assertions when directly or indirectly pressured to lie, disentangling honesty from factual accuracy.

OtherHonesty scoreHigher is better
ModelLabScoreSource modelStatusDate
Claude Opus 4.6Anthropic96.28VerifiedFeb 5, 2026Details
Claude Sonnet 4.5Anthropic96.13VerifiedSep 29, 2025Details
Claude Sonnet 4Anthropic95.33VerifiedMay 22, 2025Details
Claude Opus 4.1Anthropic94.20VerifiedAug 5, 2025Details
Claude Opus 4.5Anthropic92.53VerifiedNov 24, 2025Details
GPT-OSS-120BOpenAI92.00VerifiedAug 5, 2025Details
GPT-5.4 ProOpenAI91.73VerifiedMar 5, 2026Details
GPT-5.4OpenAI89.67VerifiedMar 5, 2026Details
Claude Opus 4Anthropic87.87VerifiedMay 22, 2025Details
GPT-5.2OpenAI86.67VerifiedDec 11, 2025Details
GPT-5.1OpenAI86.33VerifiedNov 12, 2025Details
o3OpenAI84.47VerifiedApr 16, 2025Details
GPT-5 miniOpenAI82.60VerifiedAug 7, 2025Details
OpenAI o3-proOpenAI82.50VerifiedJun 10, 2025Details
Claude 3.7 SonnetAnthropic82.13VerifiedFeb 24, 2025Details
GPT-5OpenAI79.33VerifiedAug 7, 2025Details
o4-miniOpenAI78.60VerifiedApr 16, 2025Details
Claude 3.5 SonnetAnthropic72.33VerifiedJun 20, 2024Details
Kimi K2.5Moonshot AI70.47VerifiedJan 27, 2026Details
Llama 3.1 405BMeta61.40VerifiedJul 23, 2024Details
GPT-4oOpenAI60.07VerifiedMay 13, 2024Details
DeepSeek R1DeepSeek57.32VerifiedJan 20, 2025Details
Gemini 2.5 ProGoogle DeepMind55.67VerifiedMar 25, 2025Details
GPT-4.1OpenAI51.13VerifiedApr 14, 2025Details
Llama 4 MaverickMeta49.73VerifiedApr 5, 2025Details
Gemini 2.5 FlashGoogle DeepMind49.13VerifiedApr 17, 2025Details
Gemini 2.0 FlashGoogle DeepMind49.07VerifiedDec 11, 2024Details
Mistral LargeMistral AI47.53VerifiedFeb 26, 2024Details
Kimi K2 InstructMoonshot AI46.67VerifiedJul 11, 2025Details
DeepSeek V3.1DeepSeek46.27VerifiedAug 21, 2025Details
DeepSeek V3 0324DeepSeek44.53VerifiedMar 24, 2025Details
Gemini 3 ProGoogle DeepMind42.60VerifiedNov 18, 2025Details
Gemini 3.1 Pro PreviewGoogle DeepMind42.40VerifiedFeb 19, 2026Details

Each row reports the model’s Honesty score on MASK (Model Alignment between Statements and Knowledge). Click a row for the full run context.