Question 1

What is PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts?

Accepted Answer

A multilingual mathematical reasoning benchmark of 9,000 parallel problems across 18 languages and 4 difficulty levels (K-12 to Olympiad/frontier), scored by difficulty-weighted accuracy. It is a reasoning benchmark measured by Difficulty-Weighted Accuracy (DW-ACC).

Question 2

What does Difficulty-Weighted Accuracy (DW-ACC) mean on PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts?

Accepted Answer

PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts reports Difficulty-Weighted Accuracy (DW-ACC) (%); higher is better. Scores are shown only within PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts and are never averaged with other benchmarks.

Question 3

What is the top reported PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts score?

Accepted Answer

Kimi K2 Instruct has the top reported score on PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts: 65.1% (Difficulty-Weighted Accuracy (DW-ACC)).

Question 4

Why do PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts scores differ across runs?

Accepted Answer

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Question 5

Does evals.report rank models across benchmarks?

Accepted Answer

No. PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts

What this benchmark measures

Frequently asked