Question 1

What is FrontierSWE?

Accepted Answer

Proximal Labs' ultra-long-horizon coding-agent benchmark: 17 open-ended technical projects spanning implementation, performance engineering, and applied ML research (e.g. optimizing a real compiler, inventing better ML optimizers, building a PostgreSQL-compatible server backed by SQLite). Agents get up to 20 hours per task and 5 trials each; tasks are graded 0–1 on partial progress, and frontier models barely make headway — making FrontierSWE one of the few unsaturated public coding benchmarks. Models are ranked by 'dominance' (win rate against a random opponent across tasks). It is a agents benchmark measured by dominance score.

Question 2

What does dominance score mean on FrontierSWE?

Accepted Answer

FrontierSWE reports dominance score (%); higher is better. Scores are shown only within FrontierSWE and are never averaged with other benchmarks.

Question 3

What is the top reported FrontierSWE score?

Accepted Answer

Claude Fable 5 has the top reported score on FrontierSWE: 90% (dominance score).

Question 4

Why do FrontierSWE scores differ across runs?

Accepted Answer

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Question 5

Does evals.report rank models across benchmarks?

Accepted Answer

No. FrontierSWE scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".