evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts

A multilingual mathematical reasoning benchmark of 9,000 parallel problems across 18 languages and 4 difficulty levels (K-12 to Olympiad/frontier), scored by difficulty-weighted accuracy.

ReasoningDifficulty-Weighted Accuracy (DW-ACC)Higher is better
ModelLabScoreSource modelStatusDate
Kimi K2 InstructMoonshot AI65.1%UnverifiedJul 11, 2025Details
Gemini 2.5 ProGoogle DeepMind52.2VerifiedMar 25, 2025Details
DeepSeek R1DeepSeek47.0VerifiedJan 20, 2025Details
o4-miniOpenAI45.6VerifiedApr 16, 2025Details
Claude 3.7 SonnetAnthropic33.5VerifiedFeb 24, 2025Details
DeepSeek V3 0324DeepSeek30.7VerifiedMar 24, 2025Details
GPT-4.1OpenAI26.4VerifiedApr 14, 2025Details
Llama 4 MaverickMeta26.1VerifiedApr 5, 2025Details
Llama 4 ScoutMeta20.9VerifiedApr 5, 2025Details
DeepSeek V3DeepSeek20.4VerifiedDec 26, 2024Details
GPT-4oOpenAI13.7VerifiedMay 13, 2024Details

Each row reports the model’s Difficulty-Weighted Accuracy (DW-ACC) on PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts. Click a row for the full run context.