FrontierCode
Cognition's benchmark for code mergeability and production quality, not just correctness. Tasks are drawn from 36 real open-source repositories and authored by their maintainers (40+ hours each), with concise, humanlike prompts (~1/3 the length of SWE-bench Pro). Solutions are graded against a maintainer-style rubric spanning behavioral correctness, regression safety, mechanical cleanliness, test correctness, scope, and code quality; the reported score is a weighted aggregate of the rubric items, and any solution that fails a 'blocker' criterion scores 0. Three nested subsets are published — Diamond (50 hardest tasks), Main (100), and Extended (150) — with each model run 5× at every available reasoning effort and the best effort reported. Tasks are kept private to avoid contamination.
What is FrontierCode?
Cognition's benchmark for code mergeability and production quality, not just correctness. Tasks are drawn from 36 real open-source repositories and authored by their maintainers (40+ hours each), with concise, humanlike prompts (~1/3 the length of SWE-bench Pro). Solutions are graded against a maintainer-style rubric spanning behavioral correctness, regression safety, mechanical cleanliness, test correctness, scope, and code quality; the reported score is a weighted aggregate of the rubric items, and any solution that fails a 'blocker' criterion scores 0. Three nested subsets are published — Diamond (50 hardest tasks), Main (100), and Extended (150) — with each model run 5× at every available reasoning effort and the best effort reported. Tasks are kept private to avoid contamination. evals.report tracks reported FrontierCode scores with the model, source, status, date, and run caveats attached — official leaderboard scores, vendor-reported launches, and clearly labeled community runs.
Top reported FrontierCode score: Claude Opus 4.8 — 13.4% (weighted score (Diamond)).
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Opus 4.8 | Anthropic | 13.4% | Claude Opus 4.8 | Official | May 28, 2026 | Details |
| GPT-5.5 | OpenAI | 6.3% | GPT-5.5 | Official | Apr 23, 2026 | Details |
| Claude Opus 4.7 | Anthropic | 5.2% | Claude Opus 4.7 | Official | Apr 16, 2026 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 4.7% | Gemini 3.1 Pro | Official | Feb 19, 2026 | Details |
| GPT-5.4-mini | OpenAI | 4.6% | GPT-5.4-mini | Official | Mar 17, 2026 | Details |
| Kimi K2.6 | Moonshot AI | 3.8% | Kimi K2.6 | Official | Apr 20, 2026 | Details |
| Claude Sonnet 4.6 | Anthropic | 3.5% | Claude Sonnet 4.6 | Official | Feb 17, 2026 | Details |
| SWE-1.6 | Cognition | 2.5% | SWE-1.6 | Official | Apr 7, 2026 | Details |
| MiniMax M2.7 | MiniMax | 2.4% | MiniMax-M2.7 | Official | Mar 18, 2026 | Details |
| MiniMax M2.5 | MiniMax | 1.1% | MiniMax-M2.5 | Official | Feb 12, 2026 | Details |
| Kimi K2.5 | Moonshot AI | 1.0% | Kimi K2.5 | Official | Jan 27, 2026 | Details |
| Gemini 3.1 Flash-Lite | Google DeepMind | 0.7% | Gemini 3.1 Flash-Lite | Official | Mar 3, 2026 | Details |
Each row reports the model’s weighted score (Diamond) on FrontierCode. Click a row for the full run context.