evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

IMO-Bench

A suite of IMO-level mathematical reasoning benchmarks from Google DeepMind, whose IMO-AnswerBench component tests models on 400 robustified Olympiad problems (Algebra, Combinatorics, Geometry, Number Theory) with verifiable short answers graded by an autograder.

ReasoningaccuracyHigher is better
ModelLabScoreSource modelStatusDate
Grok 4xAI73.1%VerifiedJul 9, 2025Details
Gemini 2.5 ProGoogle DeepMind68.2%VerifiedMar 25, 2025Details
o4-miniOpenAI67.9%VerifiedApr 16, 2025Details
GPT-5OpenAI65.6%VerifiedAug 7, 2025Details
o3OpenAI61.1%VerifiedApr 16, 2025Details
DeepSeek R1DeepSeek60.8%VerifiedJan 20, 2025Details
Qwen3 235B A22B Instruct 2507Alibaba / Qwen53.8%VerifiedJul 21, 2025Details
Kimi K2 InstructMoonshot AI45.8%VerifiedJul 11, 2025Details
DeepSeek V3DeepSeek37.0%VerifiedDec 26, 2024Details
Claude Sonnet 4Anthropic23.0%VerifiedMay 22, 2025Details
Claude Opus 4Anthropic22.3%VerifiedMay 22, 2025Details

Each row reports the model’s accuracy on IMO-Bench. Click a row for the full run context.