BIG-Bench Extra Hard

A general-reasoning benchmark from Google DeepMind that replaces each of the 23 BIG-Bench Hard (BBH) tasks with a novel, substantially harder task probing the same skill, measuring broad reasoning (many-hop, causal, spatial, temporal, geometric, linguistic, logic-puzzle, and humor) rather than just math and coding.

Reasoningharmonic mean accuracyHigher is better

Scores About Run this benchmark

Model	Lab	Score↓	Source model	Status	Date
Gemini 2.0 Flash	Google DeepMind	9.8%	—	Official	Dec 11, 2024	Details
DeepSeek R1	DeepSeek	6.8%	—	Official	Jan 20, 2025	Details
GPT-4o	OpenAI	6.0%	—	Official	May 13, 2024	Details

Each row reports the model’s harmonic mean accuracy on BIG-Bench Extra Hard. Click a row for the full run context.