evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

BIG-Bench Extra Hard

A general-reasoning benchmark from Google DeepMind that replaces each of the 23 BIG-Bench Hard (BBH) tasks with a novel, substantially harder task probing the same skill, measuring broad reasoning (many-hop, causal, spatial, temporal, geometric, linguistic, logic-puzzle, and humor) rather than just math and coding.

Reasoningharmonic mean accuracyHigher is better

No run guide for this benchmark yet.