BenchmarksReasoning
BIG-Bench Extra Hard
A general-reasoning benchmark from Google DeepMind that replaces each of the 23 BIG-Bench Hard (BBH) tasks with a novel, substantially harder task probing the same skill, measuring broad reasoning (many-hop, causal, spatial, temporal, geometric, linguistic, logic-puzzle, and humor) rather than just math and coding.
Reasoningharmonic mean accuracyHigher is better
No run guide for this benchmark yet.