BenchmarksCoding
Vibe Code Bench
An end-to-end web application development benchmark (by Vals AI / Replit) where models build complete full-stack web apps from natural-language specifications in a sandboxed environment with production services (Supabase, Stripe, email), then are scored by an autonomous browser agent on overall application pass accuracy.
CodingOverall accuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Opus 4.8 | Anthropic | 82.72% | — | Official | May 28, 2026 | Details |
| Claude Opus 4.7 | Anthropic | 71.00% | — | Official | Apr 16, 2026 | Details |
| GPT-5.5 | OpenAI | 69.85% | — | Official | Apr 23, 2026 | Details |
| GPT-5.4 | OpenAI | 67.42% | — | Official | Mar 5, 2026 | Details |
| GPT-5.3-Codex | OpenAI | 61.77% | — | Official | Feb 5, 2026 | Details |
| Claude Opus 4.6 | Anthropic | 57.57% | — | Official | Feb 5, 2026 | Details |
| GPT-5.2 | OpenAI | 53.50% | — | Verified | Dec 11, 2025 | Details |
| Claude Sonnet 4.6 | Anthropic | 51.48% | — | Verified | Feb 17, 2026 | Details |
| DeepSeek V4 Pro | DeepSeek | 49.93% | — | Verified | Apr 24, 2026 | Details |
| Gemini 3.5 Flash | Google DeepMind | 48.68% | — | Verified | May 19, 2026 | Details |
| MiniMax M3 | MiniMax | 47.57% | — | Official | Jun 1, 2026 | Details |
| GPT-5.2-Codex | OpenAI | 37.91% | — | Verified | Dec 18, 2025 | Details |
| Kimi K2.6 | Moonshot AI | 37.89% | — | Verified | Apr 20, 2026 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 32.03% | — | Verified | Feb 19, 2026 | Details |
| GLM-5.1 | Z.ai | 31.46% | — | Verified | Apr 7, 2026 | Details |
| Qwen 3.6 Plus | Alibaba / Qwen | 25.57% | — | Verified | Apr 2, 2026 | Details |
| GPT-5.1 | OpenAI | 24.61% | — | Verified | Nov 12, 2025 | Details |
| GLM-5 | Z.ai | 23.36% | — | Verified | Feb 11, 2026 | Details |
| Claude Sonnet 4.5 | Anthropic | 22.62% | — | Verified | Sep 29, 2025 | Details |
| Claude Opus 4.5 | Anthropic | 20.63% | — | Verified | Nov 24, 2025 | Details |
| Gemini 3 Flash | Google DeepMind | 20.20% | — | Verified | Dec 17, 2025 | Details |
| GPT-5 | OpenAI | 20.09% | — | Verified | Aug 7, 2025 | Details |
| Muse Spark | Meta | 19.67% | — | Verified | Apr 8, 2026 | Details |
| Grok 4.3 | xAI | 19.40% | — | Verified | Apr 17, 2026 | Details |
| Kimi K2.5 | Moonshot AI | 17.54% | — | Verified | Jan 27, 2026 | Details |
| MiniMax M2.5 | MiniMax | 14.85% | — | Verified | Feb 12, 2026 | Details |
| Gemini 3 Pro | Google DeepMind | 14.30% | — | Verified | Nov 18, 2025 | Details |
| GPT-5 mini | OpenAI | 14.17% | — | Verified | Aug 7, 2025 | Details |
| MiniMax M2.7 | MiniMax | 11.93% | — | Verified | Mar 18, 2026 | Details |
| Qwen3 Max | Alibaba / Qwen | 3.51% | — | Verified | Sep 5, 2025 | Details |
| GLM-4.6 | Z.ai | 3.09% | — | Verified | Sep 30, 2025 | Details |
| Grok 4.1 fast reasoning | xAI | 1.20% | — | Verified | Nov 19, 2025 | Details |
| Gemini 2.5 Pro | Google DeepMind | 0.40% | — | Verified | Mar 25, 2025 | Details |
Each row reports the model’s Overall accuracy on Vibe Code Bench. Click a row for the full run context.