BenchmarksAgents
FrontierSWE
Proximal Labs' ultra-long-horizon coding-agent benchmark: 17 open-ended technical projects spanning implementation, performance engineering, and applied ML research (e.g. optimizing a real compiler, inventing better ML optimizers, building a PostgreSQL-compatible server backed by SQLite). Agents get up to 20 hours per task and 5 trials each; tasks are graded 0–1 on partial progress, and frontier models barely make headway — making FrontierSWE one of the few unsaturated public coding benchmarks. Models are ranked by 'dominance' (win rate against a random opponent across tasks).
Agentsdominance scoreHigher is better
No run guide for this benchmark yet.