FrontierSWE
Proximal Labs' ultra-long-horizon coding-agent benchmark: 17 open-ended technical projects spanning implementation, performance engineering, and applied ML research (e.g. optimizing a real compiler, inventing better ML optimizers, building a PostgreSQL-compatible server backed by SQLite). Agents get up to 20 hours per task and 5 trials each; tasks are graded 0–1 on partial progress, and frontier models barely make headway — making FrontierSWE one of the few unsaturated public coding benchmarks. Models are ranked by 'dominance' (win rate against a random opponent across tasks).
What this benchmark measures
Proximal Labs' ultra-long-horizon coding-agent benchmark: 17 open-ended technical projects spanning implementation, performance engineering, and applied ML research (e.g. optimizing a real compiler, inventing better ML optimizers, building a PostgreSQL-compatible server backed by SQLite). Agents get up to 20 hours per task and 5 trials each; tasks are graded 0–1 on partial progress, and frontier models barely make headway — making FrontierSWE one of the few unsaturated public coding benchmarks. Models are ranked by 'dominance' (win rate against a random opponent across tasks).
Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.
The metric shown here is dominance score. It should be interpreted within FrontierSWE, not compared as part of a site-wide ranking.
What to be careful about
Scores are from Proximal's official FrontierSWE leaderboard. Each model runs with its native agent harness (Claude Code, Codex, Gemini CLI, Kimi CLI, Qwen Code), noted per row; dominance reflects that model+harness combination.
Frequently asked
What is FrontierSWE?
Proximal Labs' ultra-long-horizon coding-agent benchmark: 17 open-ended technical projects spanning implementation, performance engineering, and applied ML research (e.g. optimizing a real compiler, inventing better ML optimizers, building a PostgreSQL-compatible server backed by SQLite). Agents get up to 20 hours per task and 5 trials each; tasks are graded 0–1 on partial progress, and frontier models barely make headway — making FrontierSWE one of the few unsaturated public coding benchmarks. Models are ranked by 'dominance' (win rate against a random opponent across tasks). It is a agents benchmark measured by dominance score.
What does dominance score mean on FrontierSWE?
FrontierSWE reports dominance score (%); higher is better. Scores are shown only within FrontierSWE and are never averaged with other benchmarks.
What is the top reported FrontierSWE score?
Claude Fable 5 has the top reported score on FrontierSWE: 90% (dominance score).
Why do FrontierSWE scores differ across runs?
Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.
Does evals.report rank models across benchmarks?
No. FrontierSWE scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".