Question 1

What is FrontierCode?

Accepted Answer

Cognition's benchmark for code mergeability and production quality, not just correctness. Tasks are drawn from 36 real open-source repositories and authored by their maintainers (40+ hours each), with concise, humanlike prompts (~1/3 the length of SWE-bench Pro). Solutions are graded against a maintainer-style rubric spanning behavioral correctness, regression safety, mechanical cleanliness, test correctness, scope, and code quality; the reported score is a weighted aggregate of the rubric items, and any solution that fails a 'blocker' criterion scores 0. Revision 1.1 (2026-07-07) publishes two nested subsets — Main (100 tasks) and Extended (150) — having deprecated the original Diamond (50 hardest) subset, and zeroes runs flagged for consulting solution-bearing sources such as the original pull request. Each model is run 5× at every available reasoning effort and its best effort is reported. Tasks are kept private to avoid contamination. It is a coding benchmark measured by weighted score (Main).

Question 2

What does weighted score (Main) mean on FrontierCode?

Accepted Answer

FrontierCode reports weighted score (Main) (%); higher is better. Scores are shown only within FrontierCode and are never averaged with other benchmarks.

Question 3

What is the top reported FrontierCode score?

Accepted Answer

Claude Fable 5 has the top reported score on FrontierCode: 53.5% (weighted score (Main)).

Question 4

Why do FrontierCode scores differ across runs?

Accepted Answer

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Question 5

Does evals.report rank models across benchmarks?

Accepted Answer

No. FrontierCode scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".