BenchmarksReasoning
BIG-Bench Extra Hard
A general-reasoning benchmark from Google DeepMind that replaces each of the 23 BIG-Bench Hard (BBH) tasks with a novel, substantially harder task probing the same skill, measuring broad reasoning (many-hop, causal, spatial, temporal, geometric, linguistic, logic-puzzle, and humor) rather than just math and coding.
Reasoningharmonic mean accuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Gemini 2.0 Flash | Google DeepMind | 9.8% | — | Official | Dec 11, 2024 | Details |
| DeepSeek R1 | DeepSeek | 6.8% | — | Official | Jan 20, 2025 | Details |
| GPT-4o | OpenAI | 6.0% | — | Official | May 13, 2024 | Details |
Each row reports the model’s harmonic mean accuracy on BIG-Bench Extra Hard. Click a row for the full run context.