Question 1

What is FrontierCode?

Accepted Answer

Cognition's benchmark for code mergeability and production quality, not just correctness. Tasks are drawn from 36 real open-source repositories and authored by their maintainers (40+ hours each), with concise, humanlike prompts (~1/3 the length of SWE-bench Pro). Solutions are graded against a maintainer-style rubric spanning behavioral correctness, regression safety, mechanical cleanliness, test correctness, scope, and code quality; the reported score is a weighted aggregate of the rubric items, and any solution that fails a 'blocker' criterion scores 0. Revision 1.1 (2026-07-07) publishes two nested subsets — Main (100 tasks) and Extended (150) — having deprecated the original Diamond (50 hardest) subset, and zeroes runs flagged for consulting solution-bearing sources such as the original pull request. Each model is run 5× at every available reasoning effort and its best effort is reported. Tasks are kept private to avoid contamination. It is a coding benchmark measured by weighted score (Main).

Question 2

What does weighted score (Main) mean on FrontierCode?

Accepted Answer

FrontierCode reports weighted score (Main) (%); higher is better. Scores are shown only within FrontierCode and are never averaged with other benchmarks.

Question 3

What is the top reported FrontierCode score?

Accepted Answer

Claude Fable 5 has the top reported score on FrontierCode: 53.5% (weighted score (Main)).

Question 4

Why do FrontierCode scores differ across runs?

Accepted Answer

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Question 5

Does evals.report rank models across benchmarks?

Accepted Answer

No. FrontierCode scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

Model	Lab	Score↓	Source model	Status	Date
Claude Fable 5	Anthropic	53.5%	Fable 5	Official	Jun 9, 2026	Details
Claude Opus 5	Anthropic	53.4%	Opus 5	Official	Jul 24, 2026	Details
GPT-5.6 Sol	OpenAI	47.5%	GPT-5.6 Sol	Official	Jul 9, 2026	Details
Claude Opus 4.8	Anthropic	46.5%	Opus 4.8	Official	May 28, 2026	Details
GPT-5.5	OpenAI	43.0%	GPT-5.5	Official	Apr 23, 2026	Details
Claude Sonnet 5	Anthropic	42.7%	Sonnet 5	Official	Jun 30, 2026	Details
Grok 4.5	xAI	42.4%	Grok 4.5	Official	Jul 8, 2026	Details
SWE-1.7	Cognition	42.3%	SWE-1.7	Official	Jul 8, 2026	Details
GPT-5.6 Terra	OpenAI	41.3%	GPT-5.6 Terra	Official	Jul 9, 2026	Details
GPT-5.6 Luna	OpenAI	39.8%	GPT-5.6 Luna	Official	Jul 9, 2026	Details
Claude Opus 4.7	Anthropic	38.5%	Opus 4.7	Official	Apr 16, 2026	Details
GLM-5.2Open	Z.ai	24.5%	GLM 5.2	Official	Jun 16, 2026	Details
DeepSeek V4 ProOpen	DeepSeek	17.6%	DeepSeek V4 Pro	Official	Apr 24, 2026	Details
MiniMax M3Open	MiniMax	14.7%	MiniMax M3	Official	Jun 1, 2026	Details
InklingOpen	Thinking Machines Lab	14.0%	Inkling	Official	Jul 15, 2026	Details
SWE-1.6	Cognition	9.4%	SWE-1.6	Official	Apr 7, 2026	Details