evals.report
BenchmarksLabsCompareRun guides

FrontierCode

Cognition's benchmark for code mergeability and production quality, not just correctness. Tasks are drawn from 36 real open-source repositories and authored by their maintainers (40+ hours each), with concise, humanlike prompts (~1/3 the length of SWE-bench Pro). Solutions are graded against a maintainer-style rubric spanning behavioral correctness, regression safety, mechanical cleanliness, test correctness, scope, and code quality; the reported score is a weighted aggregate of the rubric items, and any solution that fails a 'blocker' criterion scores 0. Three nested subsets are published — Diamond (50 hardest tasks), Main (100), and Extended (150) — with each model run 5× at every available reasoning effort and the best effort reported. Tasks are kept private to avoid contamination.

Codingweighted score (Diamond)Higher is better

What this benchmark measures

Cognition's benchmark for code mergeability and production quality, not just correctness. Tasks are drawn from 36 real open-source repositories and authored by their maintainers (40+ hours each), with concise, humanlike prompts (~1/3 the length of SWE-bench Pro). Solutions are graded against a maintainer-style rubric spanning behavioral correctness, regression safety, mechanical cleanliness, test correctness, scope, and code quality; the reported score is a weighted aggregate of the rubric items, and any solution that fails a 'blocker' criterion scores 0. Three nested subsets are published — Diamond (50 hardest tasks), Main (100), and Extended (150) — with each model run 5× at every available reasoning effort and the best effort reported. Tasks are kept private to avoid contamination.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is weighted score (Diamond). It should be interpreted within FrontierCode, not compared as part of a site-wide ranking.

What to be careful about

Scores shown are the Diamond subset (the 50 hardest tasks) weighted-rubric score — FrontierCode's headline metric. Cognition also reports Main (100 tasks) and Extended (150 tasks) subsets; where those were published they're noted in the relevant row's run context.

No composite ranking
evals.report never combines benchmarks. weighted score (Diamond) on FrontierCode is its own number — don’t average it with other metrics.

Frequently asked

What is FrontierCode?

Cognition's benchmark for code mergeability and production quality, not just correctness. Tasks are drawn from 36 real open-source repositories and authored by their maintainers (40+ hours each), with concise, humanlike prompts (~1/3 the length of SWE-bench Pro). Solutions are graded against a maintainer-style rubric spanning behavioral correctness, regression safety, mechanical cleanliness, test correctness, scope, and code quality; the reported score is a weighted aggregate of the rubric items, and any solution that fails a 'blocker' criterion scores 0. Three nested subsets are published — Diamond (50 hardest tasks), Main (100), and Extended (150) — with each model run 5× at every available reasoning effort and the best effort reported. Tasks are kept private to avoid contamination. It is a coding benchmark measured by weighted score (Diamond).

What does weighted score (Diamond) mean on FrontierCode?

FrontierCode reports weighted score (Diamond) (%); higher is better. Scores are shown only within FrontierCode and are never averaged with other benchmarks.

What is the top reported FrontierCode score?

Claude Opus 4.8 has the top reported score on FrontierCode: 13.4% (weighted score (Diamond)).

Why do FrontierCode scores differ across runs?

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Does evals.report rank models across benchmarks?

No. FrontierCode scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".