evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts

A multilingual mathematical reasoning benchmark of 9,000 parallel problems across 18 languages and 4 difficulty levels (K-12 to Olympiad/frontier), scored by difficulty-weighted accuracy.

ReasoningDifficulty-Weighted Accuracy (DW-ACC)Higher is better

No run guide for this benchmark yet.