evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

AIME (OTIS Mock)

Competition mathematics in the AIME format (Epoch AI's OTIS Mock AIME 2024-2025 set), a high-signal short-answer math reasoning benchmark.

ReasoningaccuracyHigher is better
ModelLabScoreSource modelStatusDate
GPT-5.5 ProOpenAI100.0%GPT-5.5 ProOfficialMay 30, 2026Details
GPT-5.5OpenAI100.0%GPT-5.5OfficialMay 30, 2026Details
Claude Opus 4.7Anthropic97.8%Claude Opus 4.7OfficialMay 30, 2026Details
Kimi K2.6Moonshot AI96.1%Kimi K2.6OfficialMay 30, 2026Details
GPT-5.2OpenAI96.1%GPT-5.2OfficialMay 30, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind95.6%Gemini 3.1 ProOfficialMay 30, 2026Details
Gemini 3.5 FlashGoogle DeepMind95.6%Gemini 3.5 FlashOfficialMay 30, 2026Details
GPT-5.4 xHighOpenAI95.3%GPT-5.4OfficialMay 30, 2026Details
Claude Opus 4.6Anthropic94.4%Claude Opus 4.6OfficialMay 30, 2026Details
Gemini 3 FlashGoogle DeepMind92.8%Gemini 3 FlashOfficialMay 30, 2026Details
GLM-5.1Z.ai92.2%GLM-5.1OfficialMay 30, 2026Details
Kimi K2.5Moonshot AI92.2%Kimi K2.5OfficialMay 30, 2026Details
Gemini 3 ProGoogle DeepMind91.4%Gemini 3 ProOfficialMay 30, 2026Details
GPT-5 highOpenAI91.4%GPT-5OfficialMay 30, 2026Details
Qwen 3.6 Max PreviewAlibaba / Qwen91.1%Qwen 3.6 Max (Preview)OfficialMay 30, 2026Details
Qwen 3.6 PlusAlibaba / Qwen90.6%Qwen 3.6 PlusOfficialMay 30, 2026Details
Muse SparkMeta88.9%Muse SparkOfficialMay 30, 2026Details
GPT-OSS-120BOpenAI88.9%gpt-oss-120bOfficialMay 30, 2026Details
GPT-5.1OpenAI88.6%GPT-5.1OfficialMay 30, 2026Details
DeepSeek V3.2DeepSeek87.8%DeepSeek-V3.2OfficialMay 30, 2026Details

Each row reports the model’s accuracy on AIME (OTIS Mock). Click a row for the full run context.