evals.report
BenchmarksLabsCompareRun guidesIn the wild
Official benchmarks show reported scores. In-the-wild reports show what users hit after release: latency, cost, quota, regressions, surprising wins, and task-specific failures. These are source-linked anecdotes, not benchmark scores.
Models
1 selected
Fugu UltraSakana AI

Report tone

Report type

Topic

AIAcademy · on Fugu Ultra
X·@AIAcademykorea·
Mixed
The 5-hour limit has been exceeded, so I have to wait 4 hours. However, it kindly provides guidance … I like this one better because it is user-oriented, offering friendly guidance for beginners and general users.
anecdotal
AM9:21 · on Fugu Ultra
X·@AM921543266·
Positive
It discovered 27 bugs that Fable 5 couldn't find and fixed all of them. The code quality is impeccable … it implemented about 70,000 lines of new features, resolved 4 issues, and created 7 PRs.

Task Introduced Fugu into a repo previously worked on with Claude Fable 5; ~1 hour of use.

anecdotaloutput shown
am.will · on Fugu Ultra
X·@LLMJunky·
Negative
The game was pretty bad and notably worse than GPT 5.5. … GPT 5.5 by contrast did a pretty good job and required no follow ups.

Task Asked it to build a Three.js replica of Rocket League via Codex.

anecdotalprompt shownsingle run
Field testQuotaComparisonCodingAgents
View on X
Mark Santos · on Fugu Ultra
X·@markksantos·
Mixed
In terms of model speed and performance, Fugu on Opencode won … inverted directional turn, wonky camera, no sfx, not identical to Crossy Road game.

Task Head-to-head vs Claude Opus 4.8: a single-file Three.js Crossy Road game.

prompt shownsingle run