evals.report
BenchmarksLabsCompareRun guides

Vibe Code Bench

An end-to-end web application development benchmark (by Vals AI / Replit) where models build complete full-stack web apps from natural-language specifications in a sandboxed environment with production services (Supabase, Stripe, email), then are scored by an autonomous browser agent on overall application pass accuracy.

CodingOverall accuracyHigher is better
ModelLabScoreSource modelStatusDate
Claude Opus 4.8Anthropic82.72%OfficialMay 28, 2026Details
Claude Opus 4.7Anthropic71.00%OfficialApr 16, 2026Details
GPT-5.5OpenAI69.85%OfficialApr 23, 2026Details
GPT-5.4OpenAI67.42%OfficialMar 5, 2026Details
GPT-5.3-CodexOpenAI61.77%OfficialFeb 5, 2026Details
Claude Opus 4.6Anthropic57.57%OfficialFeb 5, 2026Details
GPT-5.2OpenAI53.50%VerifiedDec 11, 2025Details
Claude Sonnet 4.6Anthropic51.48%VerifiedFeb 17, 2026Details
DeepSeek V4 ProDeepSeek49.93%VerifiedApr 24, 2026Details
Gemini 3.5 FlashGoogle DeepMind48.68%VerifiedMay 19, 2026Details
MiniMax M3MiniMax47.57%OfficialJun 1, 2026Details
GPT-5.2-CodexOpenAI37.91%VerifiedDec 18, 2025Details
Kimi K2.6Moonshot AI37.89%VerifiedApr 20, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind32.03%VerifiedFeb 19, 2026Details
GLM-5.1Z.ai31.46%VerifiedApr 7, 2026Details
Qwen 3.6 PlusAlibaba / Qwen25.57%VerifiedApr 2, 2026Details
GPT-5.1OpenAI24.61%VerifiedNov 12, 2025Details
GLM-5Z.ai23.36%VerifiedFeb 11, 2026Details
Claude Sonnet 4.5Anthropic22.62%VerifiedSep 29, 2025Details
Claude Opus 4.5Anthropic20.63%VerifiedNov 24, 2025Details
Gemini 3 FlashGoogle DeepMind20.20%VerifiedDec 17, 2025Details
GPT-5OpenAI20.09%VerifiedAug 7, 2025Details
Muse SparkMeta19.67%VerifiedApr 8, 2026Details
Grok 4.3xAI19.40%VerifiedApr 17, 2026Details
Kimi K2.5Moonshot AI17.54%VerifiedJan 27, 2026Details
MiniMax M2.5MiniMax14.85%VerifiedFeb 12, 2026Details
Gemini 3 ProGoogle DeepMind14.30%VerifiedNov 18, 2025Details
GPT-5 miniOpenAI14.17%VerifiedAug 7, 2025Details
MiniMax M2.7MiniMax11.93%VerifiedMar 18, 2026Details
Qwen3 MaxAlibaba / Qwen3.51%VerifiedSep 5, 2025Details
GLM-4.6Z.ai3.09%VerifiedSep 30, 2025Details
Grok 4.1 fast reasoningxAI1.20%VerifiedNov 19, 2025Details
Gemini 2.5 ProGoogle DeepMind0.40%VerifiedMar 25, 2025Details

Each row reports the model’s Overall accuracy on Vibe Code Bench. Click a row for the full run context.