CursorBench
Cursor's agentic-coding benchmark built from real, anonymized Cursor sessions: ambiguous, multi-file tasks spanning codebase understanding, bug finding, planning, code review, editing, refactoring, and bug fixes. Each model is evaluated across reasoning-effort levels; alongside the headline pass score, Cursor reports average cost per task (USD), tokens per task, and steps per task. Cursor cautions that small score differences may not be statistically meaningful.
What this benchmark measures
Cursor's agentic-coding benchmark built from real, anonymized Cursor sessions: ambiguous, multi-file tasks spanning codebase understanding, bug finding, planning, code review, editing, refactoring, and bug fixes. Each model is evaluated across reasoning-effort levels; alongside the headline pass score, Cursor reports average cost per task (USD), tokens per task, and steps per task. Cursor cautions that small score differences may not be statistically meaningful.
Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.
The metric shown here is score. It should be interpreted within CursorBench, not compared as part of a site-wide ranking.
What to be careful about
Vendor-run benchmark (Cursor) over a private task set drawn from real Cursor sessions; not independently reproducible. CursorBench lists each model at several reasoning-effort levels — rows here take each model's best-scoring effort, noted in run context. Figures are from CursorBench 3.1.
Frequently asked
What is CursorBench?
Cursor's agentic-coding benchmark built from real, anonymized Cursor sessions: ambiguous, multi-file tasks spanning codebase understanding, bug finding, planning, code review, editing, refactoring, and bug fixes. Each model is evaluated across reasoning-effort levels; alongside the headline pass score, Cursor reports average cost per task (USD), tokens per task, and steps per task. Cursor cautions that small score differences may not be statistically meaningful. It is a agents benchmark measured by score.
What does score mean on CursorBench?
CursorBench reports score (%); higher is better. Scores are shown only within CursorBench and are never averaged with other benchmarks.
What is the top reported CursorBench score?
Claude Fable 5 has the top reported score on CursorBench: 72.9% (score).
Why do CursorBench scores differ across runs?
Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.
Does evals.report rank models across benchmarks?
No. CursorBench scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".