evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

FrontierMath Tier 4

FrontierMath Tier 4 is Epoch AI's expansion set of 50 exceptionally difficult, original research-level mathematics problems—crafted and vetted by expert mathematicians—that can take a specialist days to solve, measuring an AI model's advanced mathematical reasoning by exact-answer accuracy.

ReasoningaccuracyHigher is better
ModelLabScoreSource modelStatusDate
GPT-5.5 ProOpenAI39.6%OfficialApr 23, 2026Details
GPT-5.4 ProOpenAI37.5%OfficialMar 5, 2026Details
GPT-5.5OpenAI35.4%OfficialApr 23, 2026Details
GPT-5.4OpenAI27.1%OfficialMar 5, 2026Details
Claude Opus 4.7Anthropic22.9%OfficialApr 16, 2026Details
Claude Opus 4.6Anthropic22.9%OfficialFeb 5, 2026Details
GPT-5.2OpenAI18.8%OfficialDec 11, 2025Details
Gemini 3 ProGoogle DeepMind18.8%OfficialNov 18, 2025Details
Gemini 3.1 Pro PreviewGoogle DeepMind16.7%OfficialFeb 19, 2026Details
Muse SparkMeta14.6%OfficialApr 8, 2026Details
Gemini 3.5 FlashGoogle DeepMind14.6%OfficialMay 19, 2026Details
Kimi K2.6Moonshot AI14.6%OfficialApr 20, 2026Details
GLM-5.1Z.ai12.5%OfficialApr 7, 2026Details
GPT-5.1OpenAI12.5%OfficialNov 12, 2025Details
GPT-5OpenAI12.5%OfficialAug 7, 2025Details
Qwen 3.6 PlusAlibaba / Qwen8.3%OfficialApr 2, 2026Details
Claude Sonnet 4.6Anthropic8.3%OfficialFeb 17, 2026Details
GPT-5 miniOpenAI6.3%OfficialAug 7, 2025Details
o4-miniOpenAI6.3%OfficialApr 16, 2025Details
Kimi K2.5Moonshot AI4.2%OfficialJan 27, 2026Details
Qwen 3.6 Max PreviewAlibaba / Qwen4.2%OfficialApr 20, 2026Details
Gemini 2.5 FlashGoogle DeepMind4.2%OfficialApr 17, 2025Details
Gemini 3 FlashGoogle DeepMind4.2%OfficialDec 17, 2025Details
Claude Opus 4.5Anthropic4.2%OfficialNov 24, 2025Details
Claude Sonnet 4.5Anthropic4.2%OfficialSep 29, 2025Details
Claude Opus 4.1Anthropic4.2%OfficialAug 5, 2025Details
Gemini 2.5 ProGoogle DeepMind4.2%OfficialMar 25, 2025Details
Claude Opus 4Anthropic4.2%OfficialMay 22, 2025Details
GLM-4.6Z.ai2.1%OfficialSep 30, 2025Details
GLM-5Z.ai2.1%OfficialFeb 11, 2026Details
DeepSeek V3.2DeepSeek2.1%OfficialDec 1, 2025Details
Qwen3.5-397B-A17BAlibaba / Qwen2.1%OfficialFeb 16, 2026Details
Claude Haiku 4.5Anthropic2.1%OfficialOct 15, 2025Details
Grok 4xAI2.1%OfficialJul 9, 2025Details
o3OpenAI2.1%OfficialApr 16, 2025Details

Each row reports the model’s accuracy on FrontierMath Tier 4. Click a row for the full run context.