evals.report
BenchmarksSourcesLabsCompareRun guides
BenchmarksReasoning

FrontierMath

A frontier math benchmark with constrained public access and source-linked result claims.

ReasoningaccuracyHigher is better
ModelLabScoreSource modelStatusDate
GPT-5.5 ProOpenAI52.4%GPT-5.5 ProOfficialMay 30, 2026Details
GPT-5.5OpenAI51.7%GPT-5.5OfficialMay 30, 2026Details
GPT-5.4 ProOpenAI50.0%GPT-5.4 ProOfficialMay 30, 2026Details
GPT-5.4OpenAI47.6%GPT-5.4OfficialMay 30, 2026Details
Claude Opus 4.7Anthropic43.79%Claude Opus 4.7OfficialMay 30, 2026Details
Claude Opus 4.6Anthropic40.7%Claude Opus 4.6OfficialMay 30, 2026Details
GPT-5.2OpenAI40.7%GPT-5.2OfficialMay 30, 2026Details
Gemini 3.5 FlashGoogle DeepMind38.97%Gemini 3.5 FlashOfficialMay 30, 2026Details
Kimi K2.6Moonshot AI38.97%Kimi K2.6OfficialMay 30, 2026Details
Gemini 3 ProGoogle DeepMind37.6%Gemini 3 ProOfficialMay 30, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind36.9%Gemini 3.1 ProOfficialMay 30, 2026Details
Gemini 3 FlashGoogle DeepMind35.64%Gemini 3 FlashOfficialMay 30, 2026Details
GLM-5.1Z.ai33.45%GLM-5.1OfficialMay 30, 2026Details
GPT-5OpenAI32.41%GPT-5OfficialMay 30, 2026Details
Claude Sonnet 4.6Anthropic32.4%Claude Sonnet 4.6OfficialMay 30, 2026Details
GPT-5.1OpenAI31.03%GPT-5.1OfficialMay 30, 2026Details
Kimi K2.5Moonshot AI27.9%Kimi K2.5OfficialMay 30, 2026Details
o4-miniOpenAI24.83%o4-miniOfficialMay 30, 2026Details
DeepSeek V3.2DeepSeek22.1%DeepSeek-V3.2OfficialMay 30, 2026Details
Claude Opus 4.5Anthropic20.69%Claude Opus 4.5OfficialMay 30, 2026Details
Grok 4xAI19.66%Grok 4OfficialMay 30, 2026Details
o3OpenAI18.69%o3OfficialMay 30, 2026Details

Each row reports the model’s accuracy on FrontierMath. Click a row for the full run context.