evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

EnigmaEval

A benchmark of 1,184 puzzle-hunt challenges spanning text and images that probes models' ability to perform implicit knowledge synthesis, lateral thinking, and multi-step deductive reasoning to uncover hidden solution paths.

ReasoningaccuracyHigher is better
ModelLabScoreSource modelStatusDate
GPT-5.4 ProOpenAI23.82%VerifiedMar 5, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind19.76%VerifiedFeb 19, 2026Details
Gemini 3 ProGoogle DeepMind18.24%VerifiedNov 18, 2025Details
GPT-5.4OpenAI15.96%VerifiedMar 5, 2026Details
o3OpenAI13.09%VerifiedApr 16, 2025Details
Claude Opus 4.5Anthropic11.91%VerifiedNov 24, 2025Details
GPT-5.1OpenAI11.23%VerifiedNov 12, 2025Details
GPT-5OpenAI10.47%VerifiedAug 7, 2025Details
GPT-5.2OpenAI10.39%VerifiedDec 11, 2025Details
o4-miniOpenAI9.21%VerifiedApr 16, 2025Details
GPT-5 miniOpenAI8.19%VerifiedAug 7, 2025Details
Claude Opus 4.6Anthropic7.60%VerifiedFeb 5, 2026Details
Claude Opus 4.1Anthropic7.18%VerifiedAug 5, 2025Details
Claude Sonnet 4.5Anthropic6.00%VerifiedSep 29, 2025Details
Claude Opus 4Anthropic5.57%VerifiedMay 22, 2025Details
Gemini 2.5 ProGoogle DeepMind5.57%VerifiedMar 25, 2025Details
Claude 3.7 SonnetAnthropic4.23%VerifiedFeb 24, 2025Details
Kimi K2.5Moonshot AI3.38%VerifiedJan 27, 2026Details
Claude Sonnet 4Anthropic3.12%VerifiedMay 22, 2025Details
Gemini 2.5 FlashGoogle DeepMind2.70%VerifiedApr 17, 2025Details
GPT-4.1OpenAI2.17%VerifiedApr 14, 2025Details
Claude 3.5 SonnetAnthropic0.91%VerifiedJun 20, 2024Details
GPT-4oOpenAI0.80%VerifiedMay 13, 2024Details
Gemini 2.0 FlashGoogle DeepMind0.63%VerifiedDec 11, 2024Details
Llama 4 MaverickMeta0.58%VerifiedApr 5, 2025Details

Each row reports the model’s accuracy on EnigmaEval. Click a row for the full run context.