FrontierSWE
Proximal Labs' ultra-long-horizon coding-agent benchmark: 17 open-ended technical projects spanning implementation, performance engineering, and applied ML research (e.g. optimizing a real compiler, inventing better ML optimizers, building a PostgreSQL-compatible server backed by SQLite). Agents get up to 20 hours per task and 5 trials each; tasks are graded 0–1 on partial progress, and frontier models barely make headway — making FrontierSWE one of the few unsaturated public coding benchmarks. Models are ranked by 'dominance' (win rate against a random opponent across tasks).
What is FrontierSWE?
Proximal Labs' ultra-long-horizon coding-agent benchmark: 17 open-ended technical projects spanning implementation, performance engineering, and applied ML research (e.g. optimizing a real compiler, inventing better ML optimizers, building a PostgreSQL-compatible server backed by SQLite). Agents get up to 20 hours per task and 5 trials each; tasks are graded 0–1 on partial progress, and frontier models barely make headway — making FrontierSWE one of the few unsaturated public coding benchmarks. Models are ranked by 'dominance' (win rate against a random opponent across tasks). evals.report tracks reported FrontierSWE scores with the model, source, status, date, and run caveats attached — official leaderboard scores, vendor-reported launches, and clearly labeled community runs.
Top reported FrontierSWE score: Claude Fable 5 — 90% (dominance score).
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Fable 5 | Anthropic | 90% | Claude Fable 5 (Claude Code) | Official | Jun 9, 2026 | Details |
| Claude Opus 4.8 | Anthropic | 75% | Claude Opus 4.8 (Claude Code) | Official | May 28, 2026 | Details |
| GLM-5.2 | Z.ai | 74% | GLM-5.2 (Claude Code) | Official | Jun 16, 2026 | Details |
| GPT-5.5 | OpenAI | 73% | GPT-5.5 (Codex) | Official | Apr 23, 2026 | Details |
| Claude Opus 4.7 | Anthropic | 63% | Claude Opus 4.7 (Claude Code) | Official | Apr 16, 2026 | Details |
| Claude Opus 4.6 | Anthropic | 56% | Claude Opus 4.6 (Claude Code) | Official | Feb 5, 2026 | Details |
| GPT-5.4 | OpenAI | 54% | GPT-5.4 (Codex) | Official | Mar 5, 2026 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 40% | Gemini 3.1 Pro (Gemini CLI) | Official | Feb 19, 2026 | Details |
| GLM-5.1 | Z.ai | 31% | GLM-5.1 (Claude Code) | Official | Apr 7, 2026 | Details |
| DeepSeek V4 Pro | DeepSeek | 29% | DeepSeek V4 Pro (Claude Code) | Official | Apr 24, 2026 | Details |
| Kimi K2.6 | Moonshot AI | 27% | Kimi K2.6 (Kimi CLI) | Official | Apr 20, 2026 | Details |
| Kimi K2.5 | Moonshot AI | 26% | Kimi K2.5 (Kimi CLI) | Official | Jan 27, 2026 | Details |
| Qwen 3.6 Plus | Alibaba / Qwen | 22% | Qwen3.6-Plus (Qwen Code) | Official | Apr 2, 2026 | Details |
Each row reports the model’s dominance score on FrontierSWE. Click a row for the full run context.