evals.report
BenchmarksLabsCompareRun guides

FrontierCode

Cognition's benchmark for code mergeability and production quality, not just correctness. Tasks are drawn from 36 real open-source repositories and authored by their maintainers (40+ hours each), with concise, humanlike prompts (~1/3 the length of SWE-bench Pro). Solutions are graded against a maintainer-style rubric spanning behavioral correctness, regression safety, mechanical cleanliness, test correctness, scope, and code quality; the reported score is a weighted aggregate of the rubric items, and any solution that fails a 'blocker' criterion scores 0. Three nested subsets are published — Diamond (50 hardest tasks), Main (100), and Extended (150) — with each model run 5× at every available reasoning effort and the best effort reported. Tasks are kept private to avoid contamination.

Codingweighted score (Diamond)Higher is better

What is FrontierCode?

Cognition's benchmark for code mergeability and production quality, not just correctness. Tasks are drawn from 36 real open-source repositories and authored by their maintainers (40+ hours each), with concise, humanlike prompts (~1/3 the length of SWE-bench Pro). Solutions are graded against a maintainer-style rubric spanning behavioral correctness, regression safety, mechanical cleanliness, test correctness, scope, and code quality; the reported score is a weighted aggregate of the rubric items, and any solution that fails a 'blocker' criterion scores 0. Three nested subsets are published — Diamond (50 hardest tasks), Main (100), and Extended (150) — with each model run 5× at every available reasoning effort and the best effort reported. Tasks are kept private to avoid contamination. evals.report tracks reported FrontierCode scores with the model, source, status, date, and run caveats attached — official leaderboard scores, vendor-reported launches, and clearly labeled community runs.

Top reported FrontierCode score: Claude Opus 4.8 13.4% (weighted score (Diamond)).

ModelLabScoreSource modelStatusDate
Claude Opus 4.8Anthropic13.4%Claude Opus 4.8OfficialMay 28, 2026Details
GPT-5.5OpenAI6.3%GPT-5.5OfficialApr 23, 2026Details
Claude Opus 4.7Anthropic5.2%Claude Opus 4.7OfficialApr 16, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind4.7%Gemini 3.1 ProOfficialFeb 19, 2026Details
GPT-5.4-miniOpenAI4.6%GPT-5.4-miniOfficialMar 17, 2026Details
Kimi K2.6Moonshot AI3.8%Kimi K2.6OfficialApr 20, 2026Details
Claude Sonnet 4.6Anthropic3.5%Claude Sonnet 4.6OfficialFeb 17, 2026Details
SWE-1.6Cognition2.5%SWE-1.6OfficialApr 7, 2026Details
MiniMax M2.7MiniMax2.4%MiniMax-M2.7OfficialMar 18, 2026Details
MiniMax M2.5MiniMax1.1%MiniMax-M2.5OfficialFeb 12, 2026Details
Kimi K2.5Moonshot AI1.0%Kimi K2.5OfficialJan 27, 2026Details
Gemini 3.1 Flash-LiteGoogle DeepMind0.7%Gemini 3.1 Flash-LiteOfficialMar 3, 2026Details

Each row reports the model’s weighted score (Diamond) on FrontierCode. Click a row for the full run context.