evals.report
BenchmarksLabsCompareRun guides

CursorBench

Cursor's agentic-coding benchmark built from real, anonymized Cursor sessions: ambiguous, multi-file tasks spanning codebase understanding, bug finding, planning, code review, editing, refactoring, and bug fixes. Each model is evaluated across reasoning-effort levels; alongside the headline pass score, Cursor reports average cost per task (USD), tokens per task, and steps per task. Cursor cautions that small score differences may not be statistically meaningful.

AgentsscoreHigher is better

What is CursorBench?

Cursor's agentic-coding benchmark built from real, anonymized Cursor sessions: ambiguous, multi-file tasks spanning codebase understanding, bug finding, planning, code review, editing, refactoring, and bug fixes. Each model is evaluated across reasoning-effort levels; alongside the headline pass score, Cursor reports average cost per task (USD), tokens per task, and steps per task. Cursor cautions that small score differences may not be statistically meaningful. evals.report tracks reported CursorBench scores with the model, source, status, date, and run caveats attached — official leaderboard scores, vendor-reported launches, and clearly labeled community runs.

Top reported CursorBench score: Claude Fable 5 72.9% (score).

ModelLabScoreSource modelStatusDate
Claude Fable 5Anthropic72.9%Fable 5 MaxOfficialJun 9, 2026Details
Claude Opus 4.7Anthropic64.8%Opus 4.7 MaxOfficialApr 16, 2026Details
GPT-5.5OpenAI64.3%GPT-5.5 Extra HighOfficialApr 23, 2026Details
Claude Opus 4.8Anthropic63.8%Opus 4.8 MaxOfficialMay 28, 2026Details
Gemini 3.5 FlashGoogle DeepMind49.8%Gemini 3.5 FlashOfficialMay 19, 2026Details
Claude Sonnet 4.6Anthropic49.0%Sonnet 4.6 MaxOfficialFeb 17, 2026Details
Kimi K2.6Moonshot AI47.6%Kimi 2.6OfficialApr 20, 2026Details
Kimi K2.5Moonshot AI31.9%Kimi 2.5OfficialJan 27, 2026Details

Each row reports the model’s score on CursorBench. Click a row for the full run context.