evals.report
BenchmarksLabsCompareRun guides
BenchmarksChat preference

Design Arena

A crowdsourced human-preference benchmark where top AI models receive identical design/frontend prompts and users vote head-to-head on the anonymized outputs, producing a Bradley-Terry (Elo) ranking of design taste across categories like websites, UI components, games, and data visualization.

Chat preferenceEloHigher is better
ModelLabScoreSource modelStatusDate
Claude Opus 4.6Anthropic1344VerifiedFeb 5, 2026Details
GLM-5.1Z.ai1335VerifiedApr 7, 2026Details
Kimi K2.6Moonshot AI1335VerifiedApr 20, 2026Details
Claude Opus 4.7Anthropic1328VerifiedApr 16, 2026Details
Claude Sonnet 4.6Anthropic1327VerifiedFeb 17, 2026Details
MiMo-V2.5-ProXiaomi1325VerifiedApr 22, 2026Details
MiniMax M3MiniMax1321VerifiedJun 1, 2026Details
MiMo-V2.5Xiaomi1309VerifiedApr 22, 2026Details
Muse SparkMeta1306VerifiedApr 8, 2026Details
DeepSeek V4 ProDeepSeek1302VerifiedApr 24, 2026Details
GPT-5.5OpenAI1301VerifiedApr 23, 2026Details
GLM-5Z.ai1300VerifiedFeb 11, 2026Details
Gemini 3.5 FlashGoogle DeepMind1297VerifiedMay 19, 2026Details
Claude Opus 4.5Anthropic1295VerifiedNov 24, 2025Details
Gemini 3 ProGoogle DeepMind1295VerifiedNov 18, 2025Details
Kimi K2.5Moonshot AI1292VerifiedJan 27, 2026Details
MiniMax M2.7MiniMax1285VerifiedMar 18, 2026Details
Claude Opus 4.8Anthropic1282VerifiedMay 28, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind1282VerifiedFeb 19, 2026Details
Qwen 3.6 PlusAlibaba / Qwen1281VerifiedApr 2, 2026Details
GLM-4.7Z.ai1273VerifiedDec 22, 2025Details
Grok 4.20 beta reasoningxAI1271VerifiedMar 9, 2026Details
DeepSeek V4 FlashDeepSeek1268VerifiedApr 24, 2026Details
GPT-5.4OpenAI1264VerifiedMar 5, 2026Details
MiniMax M2.5MiniMax1261VerifiedFeb 12, 2026Details
Grok 4.3xAI1260VerifiedApr 17, 2026Details
MiniMax M2.1MiniMax1245VerifiedDec 23, 2025Details
Gemini 3 FlashGoogle DeepMind1244VerifiedDec 17, 2025Details
Claude Sonnet 4.5Anthropic1235VerifiedSep 29, 2025Details
Qwen3.5-397B-A17BAlibaba / Qwen1233VerifiedFeb 16, 2026Details
Claude 3.7 SonnetAnthropic1231VerifiedFeb 24, 2025Details
GPT-5.2OpenAI1224VerifiedDec 11, 2025Details
GPT-5OpenAI1223VerifiedAug 7, 2025Details
DeepSeek V3.2DeepSeek1220VerifiedDec 1, 2025Details
GLM-4.6Z.ai1220VerifiedSep 30, 2025Details
Claude Opus 4.1Anthropic1219VerifiedAug 5, 2025Details
GPT-5.1OpenAI1216VerifiedNov 12, 2025Details
Claude Opus 4Anthropic1215VerifiedMay 22, 2025Details
Gemini 2.5 ProGoogle DeepMind1208VerifiedMar 25, 2025Details
GPT-5.3-CodexOpenAI1199VerifiedFeb 5, 2026Details
Qwen 3 Coder 480BAlibaba / Qwen1197VerifiedJul 22, 2025Details
Claude Sonnet 4Anthropic1196VerifiedMay 22, 2025Details
DeepSeek R1DeepSeek1193VerifiedJan 20, 2025Details
Mistral Medium 3.5Mistral AI1176VerifiedApr 28, 2026Details
GPT-5 miniOpenAI1170VerifiedAug 7, 2025Details
Claude Haiku 4.5Anthropic1169VerifiedOct 15, 2025Details
DeepSeek V3.1DeepSeek1166VerifiedAug 21, 2025Details
Qwen3 MaxAlibaba / Qwen1165VerifiedSep 5, 2025Details
DeepSeek V3 0324DeepSeek1163VerifiedMar 24, 2025Details
Grok 4.1 fast reasoningxAI1142VerifiedNov 19, 2025Details
Gemini 2.5 FlashGoogle DeepMind1113VerifiedApr 17, 2025Details
Qwen3 235B A22B Instruct 2507Alibaba / Qwen1093VerifiedJul 21, 2025Details
Kimi K2 InstructMoonshot AI1088VerifiedJul 11, 2025Details
GPT-4.1OpenAI1080VerifiedApr 14, 2025Details
o3OpenAI1074VerifiedApr 16, 2025Details
Grok 4xAI1070VerifiedJul 9, 2025Details
o4-miniOpenAI1030VerifiedApr 16, 2025Details
OLMo 3.1-Think 32BAllen Institute for AI1029VerifiedDec 12, 2025Details
GPT-OSS-120BOpenAI1017VerifiedAug 5, 2025Details
Llama 4 MaverickMeta934VerifiedApr 5, 2025Details
GPT-4oOpenAI915VerifiedMay 13, 2024Details
Llama 4 ScoutMeta844VerifiedApr 5, 2025Details

Each row reports the model’s Elo on Design Arena. Click a row for the full run context.