Question 1

What is Vibe Code Bench?

Accepted Answer

An end-to-end web application development benchmark (by Vals AI / Replit) where models build complete full-stack web apps from natural-language specifications in a sandboxed environment with production services (Supabase, Stripe, email), then are scored by an autonomous browser agent on overall application pass accuracy. It is a coding benchmark measured by Overall accuracy.

Question 2

What does Overall accuracy mean on Vibe Code Bench?

Accepted Answer

Vibe Code Bench reports Overall accuracy (%); higher is better. Scores are shown only within Vibe Code Bench and are never averaged with other benchmarks.

Question 3

What is the top reported Vibe Code Bench score?

Accepted Answer

Claude Opus 4.8 has the top reported score on Vibe Code Bench: 82.72% (Overall accuracy).

Question 4

Why do Vibe Code Bench scores differ across runs?

Accepted Answer

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Question 5

Does evals.report rank models across benchmarks?

Accepted Answer

No. Vibe Code Bench scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

Vibe Code Bench

What this benchmark measures

Frequently asked