BenchmarksReasoning
PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts
A multilingual mathematical reasoning benchmark of 9,000 parallel problems across 18 languages and 4 difficulty levels (K-12 to Olympiad/frontier), scored by difficulty-weighted accuracy.
ReasoningDifficulty-Weighted Accuracy (DW-ACC)Higher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Kimi K2 Instruct | Moonshot AI | 65.1% | — | Unverified | Jul 11, 2025 | Details |
| Gemini 2.5 Pro | Google DeepMind | 52.2 | — | Verified | Mar 25, 2025 | Details |
| DeepSeek R1 | DeepSeek | 47.0 | — | Verified | Jan 20, 2025 | Details |
| o4-mini | OpenAI | 45.6 | — | Verified | Apr 16, 2025 | Details |
| Claude 3.7 Sonnet | Anthropic | 33.5 | — | Verified | Feb 24, 2025 | Details |
| DeepSeek V3 0324 | DeepSeek | 30.7 | — | Verified | Mar 24, 2025 | Details |
| GPT-4.1 | OpenAI | 26.4 | — | Verified | Apr 14, 2025 | Details |
| Llama 4 Maverick | Meta | 26.1 | — | Verified | Apr 5, 2025 | Details |
| Llama 4 Scout | Meta | 20.9 | — | Verified | Apr 5, 2025 | Details |
| DeepSeek V3 | DeepSeek | 20.4 | — | Verified | Dec 26, 2024 | Details |
| GPT-4o | OpenAI | 13.7 | — | Verified | May 13, 2024 | Details |
Each row reports the model’s Difficulty-Weighted Accuracy (DW-ACC) on PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts. Click a row for the full run context.