evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

MathArena HMMT February 2026

Contamination-free evaluation of large language models on the 33 problems of the HMMT February 2026 mathematics competition, scoring final-answer accuracy (pass@1 estimated from 4 samples per problem) on problems released after model training.

ReasoningaccuracyHigher is better

No run guide for this benchmark yet.