evals.report
BenchmarksLabsCompareRun guides

LiveCodeBench

A holistic, contamination-free benchmark that continuously collects new competitive-programming problems from LeetCode, AtCoder, and Codeforces (released after model training cutoffs) and measures code-generation correctness via Pass@1.

CodingPass@1Higher is better
ModelLabScoreSource modelStatusDate
Gemini 3 ProGoogle DeepMind91.7%UnverifiedNov 18, 2025Details
GPT-5.2OpenAI89.4%UnverifiedDec 11, 2025Details
GPT-OSS-120BOpenAI87.8%UnverifiedAug 5, 2025Details
GPT-5.1OpenAI86.8%UnverifiedNov 12, 2025Details
o4-miniOpenAI85.9%UnverifiedApr 16, 2025Details
Kimi K2 InstructMoonshot AI85.3%UnverifiedJul 11, 2025Details
GPT-5OpenAI84.6%UnverifiedAug 7, 2025Details
GPT-5 miniOpenAI83.8%UnverifiedAug 7, 2025Details
Grok 4.1 fast reasoningxAI82.2%UnverifiedNov 19, 2025Details
Grok 4xAI81.9%UnverifiedJul 9, 2025Details
MiniMax M2.1MiniMax81.0%UnverifiedDec 23, 2025Details
o3OpenAI80.8%UnverifiedApr 16, 2025Details
Gemini 2.5 ProGoogle DeepMind80.1%UnverifiedMar 25, 2025Details
Gemini 3 FlashGoogle DeepMind79.7%UnverifiedDec 17, 2025Details
Qwen3 MaxAlibaba / Qwen76.7%UnverifiedSep 5, 2025Details
Claude Opus 4.5Anthropic73.8%UnverifiedNov 24, 2025Details
DeepSeek R1DeepSeek61.7%UnverifiedJan 20, 2025Details
DeepSeek V3.2DeepSeek59.3%UnverifiedDec 1, 2025Details
Claude Sonnet 4.5Anthropic59.0%UnverifiedSep 29, 2025Details
Qwen 3 Coder 480BAlibaba / Qwen58.5%UnverifiedJul 22, 2025Details
DeepSeek V3.1DeepSeek57.7%UnverifiedAug 21, 2025Details
GLM-4.6Z.ai56.1%UnverifiedSep 30, 2025Details
Claude Opus 4Anthropic54.2%UnverifiedMay 22, 2025Details
Qwen3 235B A22B Instruct 2507Alibaba / Qwen52.4%UnverifiedJul 21, 2025Details
Claude Haiku 4.5Anthropic51.1%UnverifiedOct 15, 2025Details
Gemini 2.5 FlashGoogle DeepMind49.5%UnverifiedApr 17, 2025Details
GPT-4.1OpenAI45.7%UnverifiedApr 14, 2025Details
Claude Sonnet 4Anthropic44.9%UnverifiedMay 22, 2025Details
DeepSeek V3 0324DeepSeek40.5%UnverifiedMar 24, 2025Details
Llama 4 MaverickMeta39.7%UnverifiedApr 5, 2025Details
Claude 3.7 SonnetAnthropic39.4%UnverifiedFeb 24, 2025Details
Claude 3.5 SonnetAnthropic38.1%UnverifiedJun 20, 2024Details
Amazon Nova 2 LiteAmazon34.6%UnverifiedDec 2, 2025Details
Gemini 2.0 FlashGoogle DeepMind33.4%UnverifiedDec 11, 2024Details
Llama 3.1 405BMeta30.5%UnverifiedJul 23, 2024Details
Llama 4 ScoutMeta29.9%UnverifiedApr 5, 2025Details
Jamba 1.7 LargeAI21 Labs18.1%UnverifiedJul 3, 2025Details
Mistral LargeMistral AI17.8%UnverifiedFeb 26, 2024Details

Each row reports the model’s Pass@1 on LiveCodeBench. Click a row for the full run context.