PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts
A multilingual mathematical reasoning benchmark of 9,000 parallel problems across 18 languages and 4 difficulty levels (K-12 to Olympiad/frontier), scored by difficulty-weighted accuracy.
What this benchmark measures
A multilingual mathematical reasoning benchmark of 9,000 parallel problems across 18 languages and 4 difficulty levels (K-12 to Olympiad/frontier), scored by difficulty-weighted accuracy.
Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.
The metric shown here is Difficulty-Weighted Accuracy (DW-ACC). It should be interpreted within PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts, not compared as part of a site-wide ranking.