BenchmarksReasoning
EnigmaEval
A benchmark of 1,184 puzzle-hunt challenges spanning text and images that probes models' ability to perform implicit knowledge synthesis, lateral thinking, and multi-step deductive reasoning to uncover hidden solution paths.
ReasoningaccuracyHigher is better
No run guide for this benchmark yet.