evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

MathArena HMMT February 2026

Contamination-free evaluation of large language models on the 33 problems of the HMMT February 2026 mathematics competition, scoring final-answer accuracy (pass@1 estimated from 4 samples per problem) on problems released after model training.

ReasoningaccuracyHigher is better
ModelLabScoreSource modelStatusDate
GPT-5.4OpenAI97.73%OfficialMar 5, 2026Details
GPT-5.5OpenAI97.73%OfficialApr 23, 2026Details
GPT-5.2OpenAI96.97%OfficialDec 11, 2025Details
Claude Opus 4.6Anthropic96.21%OfficialFeb 5, 2026Details
Gemini 3.5 FlashGoogle DeepMind95.45%OfficialMay 19, 2026Details
Claude Opus 4.8Anthropic95.45%OfficialMay 28, 2026Details
Kimi K2.6Moonshot AI94.70%OfficialApr 20, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind94.70%OfficialFeb 19, 2026Details
DeepSeek V4 FlashDeepSeek93.94%OfficialApr 24, 2026Details
DeepSeek V4 ProDeepSeek93.94%OfficialApr 24, 2026Details
Claude Opus 4.7Anthropic93.94%OfficialApr 16, 2026Details
Gemini 3 FlashGoogle DeepMind89.39%OfficialDec 17, 2025Details
GLM-5.1Z.ai89.39%OfficialApr 7, 2026Details
Qwen3.5-397B-A17BAlibaba / Qwen87.88%OfficialFeb 16, 2026Details
Kimi K2.5Moonshot AI87.12%OfficialJan 27, 2026Details
Grok 4.1 fast reasoningxAI86.36%OfficialNov 19, 2025Details
GLM-5Z.ai86.36%OfficialFeb 11, 2026Details
Gemini 3 ProGoogle DeepMind86.36%OfficialNov 18, 2025Details
NVIDIA Nemotron 3 Super 120B-A12BNVIDIA84.85%OfficialMar 10, 2026Details
DeepSeek V3.2DeepSeek84.09%OfficialDec 1, 2025Details

Each row reports the model’s accuracy on MathArena HMMT February 2026. Click a row for the full run context.