BenchmarksReasoning
IMO-Bench
A suite of IMO-level mathematical reasoning benchmarks from Google DeepMind, whose IMO-AnswerBench component tests models on 400 robustified Olympiad problems (Algebra, Combinatorics, Geometry, Number Theory) with verifiable short answers graded by an autograder.
ReasoningaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Grok 4 | xAI | 73.1% | — | Verified | Jul 9, 2025 | Details |
| Gemini 2.5 Pro | Google DeepMind | 68.2% | — | Verified | Mar 25, 2025 | Details |
| o4-mini | OpenAI | 67.9% | — | Verified | Apr 16, 2025 | Details |
| GPT-5 | OpenAI | 65.6% | — | Verified | Aug 7, 2025 | Details |
| o3 | OpenAI | 61.1% | — | Verified | Apr 16, 2025 | Details |
| DeepSeek R1 | DeepSeek | 60.8% | — | Verified | Jan 20, 2025 | Details |
| Qwen3 235B A22B Instruct 2507 | Alibaba / Qwen | 53.8% | — | Verified | Jul 21, 2025 | Details |
| Kimi K2 Instruct | Moonshot AI | 45.8% | — | Verified | Jul 11, 2025 | Details |
| DeepSeek V3 | DeepSeek | 37.0% | — | Verified | Dec 26, 2024 | Details |
| Claude Sonnet 4 | Anthropic | 23.0% | — | Verified | May 22, 2025 | Details |
| Claude Opus 4 | Anthropic | 22.3% | — | Verified | May 22, 2025 | Details |
Each row reports the model’s accuracy on IMO-Bench. Click a row for the full run context.