evals.report
BenchmarksLabsCompareRun guides

GDPval

GDPval evaluates AI models agentically (shell + web access via a sandbox harness) on real-world economically valuable knowledge-work deliverables — documents, spreadsheets, slides, diagrams — spanning 44 occupations across 9 major U.S. GDP industries, scored by blind pairwise quality comparison; the Artificial Analysis GDPval-AA variant reports results as an Elo rating.

AgentsEloHigher is better
ModelLabScoreSource modelStatusDate
Claude Opus 4.8Anthropic1890OfficialMay 28, 2026Details
GPT-5.5OpenAI1769OfficialApr 23, 2026Details
Claude Opus 4.7Anthropic1753OfficialApr 16, 2026Details
Claude Sonnet 4.6Anthropic1676OfficialFeb 17, 2026Details
GPT-5.4OpenAI1674OfficialMar 5, 2026Details
Gemini 3.5 FlashGoogle DeepMind1659OfficialMay 19, 2026Details
Claude Opus 4.6Anthropic1619OfficialFeb 5, 2026Details
MiMo-V2.5-ProXiaomi1571OfficialApr 22, 2026Details
DeepSeek V4 ProDeepSeek1558OfficialApr 24, 2026Details
MiMo-V2.5Xiaomi1551OfficialApr 22, 2026Details
GLM-5.1Z.ai1535OfficialApr 7, 2026Details
MiniMax M2.7MiniMax1505OfficialMar 18, 2026Details
Qwen 3.6 Max PreviewAlibaba / Qwen1504OfficialApr 20, 2026Details
Grok 4.3xAI1495OfficialApr 17, 2026Details
GPT-5.3-CodexOpenAI1482OfficialFeb 5, 2026Details
Kimi K2.6Moonshot AI1481OfficialApr 20, 2026Details
GPT-5.2OpenAI1467OfficialDec 11, 2025Details
Claude Opus 4.5Anthropic1452OfficialNov 24, 2025Details
Muse SparkMeta1417OfficialApr 8, 2026Details
DeepSeek V4 FlashDeepSeek1414OfficialApr 24, 2026Details
GLM-5Z.ai1395OfficialFeb 11, 2026Details
Qwen 3.6 PlusAlibaba / Qwen1354OfficialApr 2, 2026Details
Gemini 3 Deep ThinkGoogle DeepMind1324OfficialDec 4, 2025Details
Claude Sonnet 4.5Anthropic1317OfficialSep 29, 2025Details
Gemini 3.1 Pro PreviewGoogle DeepMind1314OfficialFeb 19, 2026Details
Step 3.7 FlashStepFun1298OfficialMay 29, 2026Details
GPT-5OpenAI1294OfficialAug 7, 2025Details
GPT-5.2-CodexOpenAI1288OfficialDec 18, 2025Details
Kimi K2.5Moonshot AI1285OfficialJan 27, 2026Details
GPT-5.1OpenAI1227OfficialNov 12, 2025Details
Qwen3.5-397B-A17BAlibaba / Qwen1220OfficialFeb 16, 2026Details
Gemini 3 FlashGoogle DeepMind1204OfficialDec 17, 2025Details
DeepSeek V3.2DeepSeek1197OfficialDec 1, 2025Details
GLM-4.7Z.ai1185OfficialDec 22, 2025Details
GPT-5 miniOpenAI1184OfficialAug 7, 2025Details
Gemini 3 ProGoogle DeepMind1184OfficialNov 18, 2025Details
MiniMax M2.5MiniMax1176OfficialFeb 12, 2026Details
Claude Haiku 4.5Anthropic1171OfficialOct 15, 2025Details
Mistral Medium 3.5Mistral AI1168OfficialApr 28, 2026Details
Claude Sonnet 4Anthropic1133OfficialMay 22, 2025Details
MiniMax M2.1MiniMax1091OfficialDec 23, 2025Details
DeepSeek V3.1DeepSeek1080OfficialAug 21, 2025Details
Gemini 2.5 FlashGoogle DeepMind1071OfficialApr 17, 2025Details
Claude 3.7 SonnetAnthropic1048OfficialFeb 24, 2025Details
Grok 4.1 fast reasoningxAI1046OfficialNov 19, 2025Details
Qwen3 MaxAlibaba / Qwen1038OfficialSep 5, 2025Details
GLM-4.6Z.ai1029OfficialSep 30, 2025Details
o4-miniOpenAI1008OfficialApr 16, 2025Details
Kimi K2 ThinkingMoonshot AI992OfficialNov 6, 2025Details
Grok 4xAI989OfficialJul 9, 2025Details
GPT-OSS-120BOpenAI947OfficialAug 5, 2025Details
Gemini 2.5 ProGoogle DeepMind919OfficialMar 25, 2025Details
Mistral LargeMistral AI864OfficialFeb 26, 2024Details
K-EXAONELG AI Research825OfficialJan 12, 2026Details
Qwen3 235B A22B Instruct 2507Alibaba / Qwen778OfficialJul 21, 2025Details
GPT-4.1OpenAI776OfficialApr 14, 2025Details
o3OpenAI753OfficialApr 16, 2025Details
Gemini 2.0 FlashGoogle DeepMind566OfficialDec 11, 2024Details
Qwen 3 Coder 480BAlibaba / Qwen506OfficialJul 22, 2025Details
Solar Pro 2Upstage449OfficialJul 10, 2025Details
Llama 4 MaverickMeta435OfficialApr 5, 2025Details
DeepSeek V3 0324DeepSeek407OfficialMar 24, 2025Details
GPT-4oOpenAI378OfficialMay 13, 2024Details
Jamba 1.7 LargeAI21 Labs282OfficialJul 3, 2025Details
Llama 4 ScoutMeta270OfficialApr 5, 2025Details
Llama 3.1 405BMeta255OfficialJul 23, 2024Details
DeepSeek R1DeepSeek248OfficialJan 20, 2025Details

Each row reports the model’s Elo on GDPval. Click a row for the full run context.