BenchmarksCoding
SciCode
A scientist-curated benchmark that evaluates language models on realistic scientific research coding problems, comprising 338 subproblems decomposed from 80 challenging main problems across 16 natural-science subfields (physics, math, chemistry, biology, materials science).
CodingaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Gemini 3.1 Pro Preview | Google DeepMind | 58.9% | — | Unverified | Feb 19, 2026 | Details |
| GPT-5.4 | OpenAI | 56.6% | — | Unverified | Mar 5, 2026 | Details |
| Gemini 3 Pro | Google DeepMind | 56.1% | — | Unverified | Nov 18, 2025 | Details |
| GPT-5.2-Codex | OpenAI | 54.6% | — | Unverified | Dec 18, 2025 | Details |
| Claude Opus 4.7 | Anthropic | 54.5% | — | Unverified | Apr 16, 2026 | Details |
| Kimi K2.6 | Moonshot AI | 53.5% | — | Unverified | Apr 20, 2026 | Details |
| GPT-5.3-Codex | OpenAI | 53.2% | — | Unverified | Feb 5, 2026 | Details |
| Gemini 3.5 Flash | Google DeepMind | 53.1% | — | Unverified | May 19, 2026 | Details |
| Claude Opus 4.6 | Anthropic | 51.9% | — | Unverified | Feb 5, 2026 | Details |
| Gemini 3 Flash | Google DeepMind | 50.6% | — | Unverified | Dec 17, 2025 | Details |
| MiMo-V2.5-Pro | Xiaomi | 50.2% | — | Unverified | Apr 22, 2026 | Details |
| DeepSeek V4 Pro | DeepSeek | 50.0% | — | Unverified | Apr 24, 2026 | Details |
| Claude Opus 4.5 | Anthropic | 49.5% | — | Unverified | Nov 24, 2025 | Details |
| Kimi K2.5 | Moonshot AI | 49.0% | — | Unverified | Jan 27, 2026 | Details |
| Qwen3.7 Max Preview | Alibaba / Qwen | 48.8% | — | Unverified | May 14, 2026 | Details |
| MiniMax M2.7 | MiniMax | 47.0% | — | Unverified | Mar 18, 2026 | Details |
| Claude Sonnet 4.6 | Anthropic | 46.9% | — | Unverified | Feb 17, 2026 | Details |
| o4-mini | OpenAI | 46.5% | — | Unverified | Apr 16, 2025 | Details |
| GPT-5.2 | OpenAI | 46.2% | — | Unverified | Dec 11, 2025 | Details |
| GLM-5 | Z.ai | 46.2% | — | Unverified | Feb 11, 2026 | Details |
| Grok 4 | xAI | 45.7% | — | Unverified | Jul 9, 2025 | Details |
| GLM-4.7 | Z.ai | 45.1% | — | Unverified | Dec 22, 2025 | Details |
| DeepSeek V4 Flash | DeepSeek | 44.9% | — | Unverified | Apr 24, 2026 | Details |
| Claude Sonnet 4.5 | Anthropic | 44.7% | — | Unverified | Sep 29, 2025 | Details |
| Grok 4.1 fast reasoning | xAI | 44.2% | — | Unverified | Nov 19, 2025 | Details |
| GLM-5.1 | Z.ai | 43.8% | — | Unverified | Apr 7, 2026 | Details |
| Claude Haiku 4.5 | Anthropic | 43.3% | — | Unverified | Oct 15, 2025 | Details |
| GPT-5.1 | OpenAI | 43.3% | — | Unverified | Nov 12, 2025 | Details |
| Qwen3 Max | Alibaba / Qwen | 43.1% | — | Unverified | Sep 5, 2025 | Details |
| GPT-5 | OpenAI | 42.9% | — | Unverified | Aug 7, 2025 | Details |
| Gemini 2.5 Pro | Google DeepMind | 42.8% | — | Unverified | Mar 25, 2025 | Details |
| MiniMax M2.5 | MiniMax | 42.6% | — | Unverified | Feb 12, 2026 | Details |
| Kimi K2 Instruct | Moonshot AI | 42.4% | — | Unverified | Jul 11, 2025 | Details |
| Qwen3.5-397B-A17B | Alibaba / Qwen | 42.0% | — | Unverified | Feb 16, 2026 | Details |
| o3 | OpenAI | 41.0% | — | Unverified | Apr 16, 2025 | Details |
| GPT-5 mini | OpenAI | 41.0% | — | Unverified | Aug 7, 2025 | Details |
| Claude Opus 4 | Anthropic | 40.9% | — | Unverified | May 22, 2025 | Details |
| Claude Opus 4.1 | Anthropic | 40.9% | — | Unverified | Aug 5, 2025 | Details |
| MiniMax M2.1 | MiniMax | 40.7% | — | Unverified | Dec 23, 2025 | Details |
| Claude 3.7 Sonnet | Anthropic | 40.3% | — | Unverified | Feb 24, 2025 | Details |
| Claude Sonnet 4 | Anthropic | 40.0% | — | Unverified | May 22, 2025 | Details |
| GPT-OSS-120B | OpenAI | 38.9% | — | Unverified | Aug 5, 2025 | Details |
| DeepSeek V3.2 | DeepSeek | 38.7% | — | Unverified | Dec 1, 2025 | Details |
| GLM-4.6 | Z.ai | 38.4% | — | Unverified | Sep 30, 2025 | Details |
| GPT-4.1 | OpenAI | 38.1% | — | Unverified | Apr 14, 2025 | Details |
| DeepSeek V3.1 | DeepSeek | 36.7% | — | Unverified | Aug 21, 2025 | Details |
| Claude 3.5 Sonnet | Anthropic | 36.6% | — | Unverified | Jun 20, 2024 | Details |
| Qwen3 235B A22B Instruct 2507 | Alibaba / Qwen | 36.0% | — | Unverified | Jul 21, 2025 | Details |
| Qwen 3 Coder 480B | Alibaba / Qwen | 35.9% | — | Unverified | Jul 22, 2025 | Details |
| DeepSeek V3 0324 | DeepSeek | 35.8% | — | Unverified | Mar 24, 2025 | Details |
| DeepSeek R1 | DeepSeek | 35.7% | — | Unverified | Jan 20, 2025 | Details |
| Gemini 2.0 Flash | Google DeepMind | 33.3% | — | Unverified | Dec 11, 2024 | Details |
| Llama 4 Maverick | Meta | 33.1% | — | Unverified | Apr 5, 2025 | Details |
| Llama 3.1 405B | Meta | 29.9% | — | Unverified | Jul 23, 2024 | Details |
| Gemini 2.5 Flash | Google DeepMind | 29.1% | — | Unverified | Apr 17, 2025 | Details |
| Mistral Large | Mistral AI | 20.8% | — | Unverified | Feb 26, 2024 | Details |
| Llama 4 Scout | Meta | 17.0% | — | Unverified | Apr 5, 2025 | Details |
| GPT-4o | OpenAI | 1.5% | — | Official | May 13, 2024 | Details |
| Gemini 1.5 Pro | Google DeepMind | 1.5% | — | Official | Feb 15, 2024 | Details |
Each row reports the model’s accuracy on SciCode. Click a row for the full run context.