BIG-Bench Extra Hard

Name: BIG-Bench Extra Hard
Creator: evals.report
License: https://creativecommons.org/licenses/by/4.0/

A general-reasoning benchmark from Google DeepMind that replaces each of the 23 BIG-Bench Hard (BBH) tasks with a novel, substantially harder task probing the same skill, measuring broad reasoning (many-hop, causal, spatial, temporal, geometric, linguistic, logic-puzzle, and humor) rather than just math and coding.

Reasoningharmonic mean accuracyHigher is better

Scores About Run this benchmark

No run guide for this benchmark yet.