evals.report
BenchmarksLabsCompareRun guides

FrontierSWE

Proximal Labs' ultra-long-horizon coding-agent benchmark: 17 open-ended technical projects spanning implementation, performance engineering, and applied ML research (e.g. optimizing a real compiler, inventing better ML optimizers, building a PostgreSQL-compatible server backed by SQLite). Agents get up to 20 hours per task and 5 trials each; tasks are graded 0–1 on partial progress, and frontier models barely make headway — making FrontierSWE one of the few unsaturated public coding benchmarks. Models are ranked by 'dominance' (win rate against a random opponent across tasks).

Agentsdominance scoreHigher is better

What is FrontierSWE?

Proximal Labs' ultra-long-horizon coding-agent benchmark: 17 open-ended technical projects spanning implementation, performance engineering, and applied ML research (e.g. optimizing a real compiler, inventing better ML optimizers, building a PostgreSQL-compatible server backed by SQLite). Agents get up to 20 hours per task and 5 trials each; tasks are graded 0–1 on partial progress, and frontier models barely make headway — making FrontierSWE one of the few unsaturated public coding benchmarks. Models are ranked by 'dominance' (win rate against a random opponent across tasks). evals.report tracks reported FrontierSWE scores with the model, source, status, date, and run caveats attached — official leaderboard scores, vendor-reported launches, and clearly labeled community runs.

Top reported FrontierSWE score: Claude Fable 5 90% (dominance score).

ModelLabScoreSource modelStatusDate
Claude Fable 5Anthropic90%Claude Fable 5 (Claude Code)OfficialJun 9, 2026Details
Claude Opus 4.8Anthropic75%Claude Opus 4.8 (Claude Code)OfficialMay 28, 2026Details
GLM-5.2Z.ai74%GLM-5.2 (Claude Code)OfficialJun 16, 2026Details
GPT-5.5OpenAI73%GPT-5.5 (Codex)OfficialApr 23, 2026Details
Claude Opus 4.7Anthropic63%Claude Opus 4.7 (Claude Code)OfficialApr 16, 2026Details
Claude Opus 4.6Anthropic56%Claude Opus 4.6 (Claude Code)OfficialFeb 5, 2026Details
GPT-5.4OpenAI54%GPT-5.4 (Codex)OfficialMar 5, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind40%Gemini 3.1 Pro (Gemini CLI)OfficialFeb 19, 2026Details
GLM-5.1Z.ai31%GLM-5.1 (Claude Code)OfficialApr 7, 2026Details
DeepSeek V4 ProDeepSeek29%DeepSeek V4 Pro (Claude Code)OfficialApr 24, 2026Details
Kimi K2.6Moonshot AI27%Kimi K2.6 (Kimi CLI)OfficialApr 20, 2026Details
Kimi K2.5Moonshot AI26%Kimi K2.5 (Kimi CLI)OfficialJan 27, 2026Details
Qwen 3.6 PlusAlibaba / Qwen22%Qwen3.6-Plus (Qwen Code)OfficialApr 2, 2026Details

Each row reports the model’s dominance score on FrontierSWE. Click a row for the full run context.