evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts

A multilingual mathematical reasoning benchmark of 9,000 parallel problems across 18 languages and 4 difficulty levels (K-12 to Olympiad/frontier), scored by difficulty-weighted accuracy.

ReasoningDifficulty-Weighted Accuracy (DW-ACC)Higher is better

What this benchmark measures

A multilingual mathematical reasoning benchmark of 9,000 parallel problems across 18 languages and 4 difficulty levels (K-12 to Olympiad/frontier), scored by difficulty-weighted accuracy.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is Difficulty-Weighted Accuracy (DW-ACC). It should be interpreted within PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts, not compared as part of a site-wide ranking.

No composite ranking
evals.report never combines benchmarks. Difficulty-Weighted Accuracy (DW-ACC) on PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts is its own number — don’t average it with other metrics.