evals.report
BenchmarksLabsCompareRun guides

FrontierCode

Cognition's benchmark for code mergeability and production quality, not just correctness. Tasks are drawn from 36 real open-source repositories and authored by their maintainers (40+ hours each), with concise, humanlike prompts (~1/3 the length of SWE-bench Pro). Solutions are graded against a maintainer-style rubric spanning behavioral correctness, regression safety, mechanical cleanliness, test correctness, scope, and code quality; the reported score is a weighted aggregate of the rubric items, and any solution that fails a 'blocker' criterion scores 0. Three nested subsets are published — Diamond (50 hardest tasks), Main (100), and Extended (150) — with each model run 5× at every available reasoning effort and the best effort reported. Tasks are kept private to avoid contamination.

Codingweighted score (Diamond)Higher is better

No run guide for this benchmark yet.