evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

AIME 2026

Accuracy of LLMs on the 30 problems of the 2026 American Invitational Mathematics Examination (AIME I and II), a contamination-free competition-math benchmark requiring integer answers (0-999), evaluated live by MathArena.

ReasoningaccuracyHigher is better
ModelLabScoreSource modelStatusDate
Claude Opus 4.8Anthropic100.00%OfficialMay 28, 2026Details
GPT-5.4OpenAI99.17%OfficialMar 5, 2026Details
GPT-5.2OpenAI98.33%OfficialDec 11, 2025Details
Gemini 3.1 Pro PreviewGoogle DeepMind98.33%OfficialFeb 19, 2026Details
GPT-5.5OpenAI97.50%OfficialApr 23, 2026Details
Claude Opus 4.6Anthropic96.67%OfficialFeb 5, 2026Details
DeepSeek V4 FlashDeepSeek95.83%OfficialApr 24, 2026Details
Gemini 3 FlashGoogle DeepMind95.83%OfficialDec 17, 2025Details
Kimi K2.5Moonshot AI95.83%OfficialJan 27, 2026Details
GLM-5Z.ai95.83%OfficialFeb 11, 2026Details
DeepSeek V4 ProDeepSeek95.83%OfficialApr 24, 2026Details
GLM-5.1Z.ai95.83%OfficialApr 7, 2026Details
Kimi K2.6Moonshot AI95.83%OfficialApr 20, 2026Details
Claude Opus 4.7Anthropic95.83%OfficialApr 16, 2026Details
Gemini 3.5 FlashGoogle DeepMind95.00%OfficialMay 19, 2026Details
Grok 4.1 fast reasoningxAI94.17%OfficialNov 19, 2025Details
DeepSeek V3.2DeepSeek94.17%OfficialDec 1, 2025Details
Qwen3.5-397B-A17BAlibaba / Qwen93.33%OfficialFeb 16, 2026Details
Gemini 3 ProGoogle DeepMind91.67%OfficialNov 18, 2025Details
NVIDIA Nemotron 3 Super 120B-A12BNVIDIA90.00%OfficialMar 10, 2026Details

Each row reports the model’s accuracy on AIME 2026. Click a row for the full run context.