evals.report
BenchmarksLabsCompareRun guides

SciCode

A scientist-curated benchmark that evaluates language models on realistic scientific research coding problems, comprising 338 subproblems decomposed from 80 challenging main problems across 16 natural-science subfields (physics, math, chemistry, biology, materials science).

CodingaccuracyHigher is better
ModelLabScoreSource modelStatusDate
Gemini 3.1 Pro PreviewGoogle DeepMind58.9%UnverifiedFeb 19, 2026Details
GPT-5.4OpenAI56.6%UnverifiedMar 5, 2026Details
Gemini 3 ProGoogle DeepMind56.1%UnverifiedNov 18, 2025Details
GPT-5.2-CodexOpenAI54.6%UnverifiedDec 18, 2025Details
Claude Opus 4.7Anthropic54.5%UnverifiedApr 16, 2026Details
Kimi K2.6Moonshot AI53.5%UnverifiedApr 20, 2026Details
GPT-5.3-CodexOpenAI53.2%UnverifiedFeb 5, 2026Details
Gemini 3.5 FlashGoogle DeepMind53.1%UnverifiedMay 19, 2026Details
Claude Opus 4.6Anthropic51.9%UnverifiedFeb 5, 2026Details
Gemini 3 FlashGoogle DeepMind50.6%UnverifiedDec 17, 2025Details
MiMo-V2.5-ProXiaomi50.2%UnverifiedApr 22, 2026Details
DeepSeek V4 ProDeepSeek50.0%UnverifiedApr 24, 2026Details
Claude Opus 4.5Anthropic49.5%UnverifiedNov 24, 2025Details
Kimi K2.5Moonshot AI49.0%UnverifiedJan 27, 2026Details
Qwen3.7 Max PreviewAlibaba / Qwen48.8%UnverifiedMay 14, 2026Details
MiniMax M2.7MiniMax47.0%UnverifiedMar 18, 2026Details
Claude Sonnet 4.6Anthropic46.9%UnverifiedFeb 17, 2026Details
o4-miniOpenAI46.5%UnverifiedApr 16, 2025Details
GPT-5.2OpenAI46.2%UnverifiedDec 11, 2025Details
GLM-5Z.ai46.2%UnverifiedFeb 11, 2026Details
Grok 4xAI45.7%UnverifiedJul 9, 2025Details
GLM-4.7Z.ai45.1%UnverifiedDec 22, 2025Details
DeepSeek V4 FlashDeepSeek44.9%UnverifiedApr 24, 2026Details
Claude Sonnet 4.5Anthropic44.7%UnverifiedSep 29, 2025Details
Grok 4.1 fast reasoningxAI44.2%UnverifiedNov 19, 2025Details
GLM-5.1Z.ai43.8%UnverifiedApr 7, 2026Details
Claude Haiku 4.5Anthropic43.3%UnverifiedOct 15, 2025Details
GPT-5.1OpenAI43.3%UnverifiedNov 12, 2025Details
Qwen3 MaxAlibaba / Qwen43.1%UnverifiedSep 5, 2025Details
GPT-5OpenAI42.9%UnverifiedAug 7, 2025Details
Gemini 2.5 ProGoogle DeepMind42.8%UnverifiedMar 25, 2025Details
MiniMax M2.5MiniMax42.6%UnverifiedFeb 12, 2026Details
Kimi K2 InstructMoonshot AI42.4%UnverifiedJul 11, 2025Details
Qwen3.5-397B-A17BAlibaba / Qwen42.0%UnverifiedFeb 16, 2026Details
o3OpenAI41.0%UnverifiedApr 16, 2025Details
GPT-5 miniOpenAI41.0%UnverifiedAug 7, 2025Details
Claude Opus 4Anthropic40.9%UnverifiedMay 22, 2025Details
Claude Opus 4.1Anthropic40.9%UnverifiedAug 5, 2025Details
MiniMax M2.1MiniMax40.7%UnverifiedDec 23, 2025Details
Claude 3.7 SonnetAnthropic40.3%UnverifiedFeb 24, 2025Details
Claude Sonnet 4Anthropic40.0%UnverifiedMay 22, 2025Details
GPT-OSS-120BOpenAI38.9%UnverifiedAug 5, 2025Details
DeepSeek V3.2DeepSeek38.7%UnverifiedDec 1, 2025Details
GLM-4.6Z.ai38.4%UnverifiedSep 30, 2025Details
GPT-4.1OpenAI38.1%UnverifiedApr 14, 2025Details
DeepSeek V3.1DeepSeek36.7%UnverifiedAug 21, 2025Details
Claude 3.5 SonnetAnthropic36.6%UnverifiedJun 20, 2024Details
Qwen3 235B A22B Instruct 2507Alibaba / Qwen36.0%UnverifiedJul 21, 2025Details
Qwen 3 Coder 480BAlibaba / Qwen35.9%UnverifiedJul 22, 2025Details
DeepSeek V3 0324DeepSeek35.8%UnverifiedMar 24, 2025Details
DeepSeek R1DeepSeek35.7%UnverifiedJan 20, 2025Details
Gemini 2.0 FlashGoogle DeepMind33.3%UnverifiedDec 11, 2024Details
Llama 4 MaverickMeta33.1%UnverifiedApr 5, 2025Details
Llama 3.1 405BMeta29.9%UnverifiedJul 23, 2024Details
Gemini 2.5 FlashGoogle DeepMind29.1%UnverifiedApr 17, 2025Details
Mistral LargeMistral AI20.8%UnverifiedFeb 26, 2024Details
Llama 4 ScoutMeta17.0%UnverifiedApr 5, 2025Details
GPT-4oOpenAI1.5%OfficialMay 13, 2024Details
Gemini 1.5 ProGoogle DeepMind1.5%OfficialFeb 15, 2024Details

Each row reports the model’s accuracy on SciCode. Click a row for the full run context.