CursorBench
Cursor's agentic-coding benchmark built from real, anonymized Cursor sessions: ambiguous, multi-file tasks spanning codebase understanding, bug finding, planning, code review, editing, refactoring, and bug fixes. Each model is evaluated across reasoning-effort levels; alongside the headline pass score, Cursor reports average cost per task (USD), tokens per task, and steps per task. Cursor cautions that small score differences may not be statistically meaningful.
What is CursorBench?
Cursor's agentic-coding benchmark built from real, anonymized Cursor sessions: ambiguous, multi-file tasks spanning codebase understanding, bug finding, planning, code review, editing, refactoring, and bug fixes. Each model is evaluated across reasoning-effort levels; alongside the headline pass score, Cursor reports average cost per task (USD), tokens per task, and steps per task. Cursor cautions that small score differences may not be statistically meaningful. evals.report tracks reported CursorBench scores with the model, source, status, date, and run caveats attached — official leaderboard scores, vendor-reported launches, and clearly labeled community runs.
Top reported CursorBench score: Claude Fable 5 — 72.9% (score).
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Fable 5 | Anthropic | 72.9% | Fable 5 Max | Official | Jun 9, 2026 | Details |
| Claude Opus 4.7 | Anthropic | 64.8% | Opus 4.7 Max | Official | Apr 16, 2026 | Details |
| GPT-5.5 | OpenAI | 64.3% | GPT-5.5 Extra High | Official | Apr 23, 2026 | Details |
| Claude Opus 4.8 | Anthropic | 63.8% | Opus 4.8 Max | Official | May 28, 2026 | Details |
| Gemini 3.5 Flash | Google DeepMind | 49.8% | Gemini 3.5 Flash | Official | May 19, 2026 | Details |
| Claude Sonnet 4.6 | Anthropic | 49.0% | Sonnet 4.6 Max | Official | Feb 17, 2026 | Details |
| Kimi K2.6 | Moonshot AI | 47.6% | Kimi 2.6 | Official | Apr 20, 2026 | Details |
| Kimi K2.5 | Moonshot AI | 31.9% | Kimi 2.5 | Official | Jan 27, 2026 | Details |
Each row reports the model’s score on CursorBench. Click a row for the full run context.