BenchmarksReasoning
PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts
A multilingual mathematical reasoning benchmark of 9,000 parallel problems across 18 languages and 4 difficulty levels (K-12 to Olympiad/frontier), scored by difficulty-weighted accuracy.
ReasoningDifficulty-Weighted Accuracy (DW-ACC)Higher is better
No run guide for this benchmark yet.