BenchmarksReasoning
AIME (OTIS Mock)
Competition mathematics in the AIME format (Epoch AI's OTIS Mock AIME 2024-2025 set), a high-signal short-answer math reasoning benchmark.
ReasoningaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| GPT-5.5 Pro | OpenAI | 100.0% | GPT-5.5 Pro | Official | May 30, 2026 | Details |
| GPT-5.5 | OpenAI | 100.0% | GPT-5.5 | Official | May 30, 2026 | Details |
| Claude Opus 4.7 | Anthropic | 97.8% | Claude Opus 4.7 | Official | May 30, 2026 | Details |
| Kimi K2.6 | Moonshot AI | 96.1% | Kimi K2.6 | Official | May 30, 2026 | Details |
| GPT-5.2 | OpenAI | 96.1% | GPT-5.2 | Official | May 30, 2026 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 95.6% | Gemini 3.1 Pro | Official | May 30, 2026 | Details |
| Gemini 3.5 Flash | Google DeepMind | 95.6% | Gemini 3.5 Flash | Official | May 30, 2026 | Details |
| GPT-5.4 xHigh | OpenAI | 95.3% | GPT-5.4 | Official | May 30, 2026 | Details |
| Claude Opus 4.6 | Anthropic | 94.4% | Claude Opus 4.6 | Official | May 30, 2026 | Details |
| Gemini 3 Flash | Google DeepMind | 92.8% | Gemini 3 Flash | Official | May 30, 2026 | Details |
| GLM-5.1 | Z.ai | 92.2% | GLM-5.1 | Official | May 30, 2026 | Details |
| Kimi K2.5 | Moonshot AI | 92.2% | Kimi K2.5 | Official | May 30, 2026 | Details |
| Gemini 3 Pro | Google DeepMind | 91.4% | Gemini 3 Pro | Official | May 30, 2026 | Details |
| GPT-5 high | OpenAI | 91.4% | GPT-5 | Official | May 30, 2026 | Details |
| Qwen 3.6 Max Preview | Alibaba / Qwen | 91.1% | Qwen 3.6 Max (Preview) | Official | May 30, 2026 | Details |
| Qwen 3.6 Plus | Alibaba / Qwen | 90.6% | Qwen 3.6 Plus | Official | May 30, 2026 | Details |
| Muse Spark | Meta | 88.9% | Muse Spark | Official | May 30, 2026 | Details |
| GPT-OSS-120B | OpenAI | 88.9% | gpt-oss-120b | Official | May 30, 2026 | Details |
| GPT-5.1 | OpenAI | 88.6% | GPT-5.1 | Official | May 30, 2026 | Details |
| DeepSeek V3.2 | DeepSeek | 87.8% | DeepSeek-V3.2 | Official | May 30, 2026 | Details |
Each row reports the model’s accuracy on AIME (OTIS Mock). Click a row for the full run context.