evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

BIG-Bench Extra Hard

A general-reasoning benchmark from Google DeepMind that replaces each of the 23 BIG-Bench Hard (BBH) tasks with a novel, substantially harder task probing the same skill, measuring broad reasoning (many-hop, causal, spatial, temporal, geometric, linguistic, logic-puzzle, and humor) rather than just math and coding.

Reasoningharmonic mean accuracyHigher is better
ModelLabScoreSource modelStatusDate
Gemini 2.0 FlashGoogle DeepMind9.8%OfficialDec 11, 2024Details
DeepSeek R1DeepSeek6.8%OfficialJan 20, 2025Details
GPT-4oOpenAI6.0%OfficialMay 13, 2024Details

Each row reports the model’s harmonic mean accuracy on BIG-Bench Extra Hard. Click a row for the full run context.