evals.report
BenchmarksLabsCompareRun guidesIn the wild
Official benchmarks show reported scores. In-the-wild reports show what users hit after release: latency, cost, quota, regressions, surprising wins, and task-specific failures. These are source-linked anecdotes, not benchmark scores.
Models
0 selected
All models

Report tone

Report type

Topic

Prem · on Claude Opus 4.8
X·@btrmasaladosa·
Positive
i was stuck on a landing page redesign with gpt 5.5 and opus 4.6 since a couple of days. gave a fresh try with opus 4.8 and it one shotted what i was looking for

Task A landing-page redesign GPT-5.5 and Opus 4.6 hadn't cracked over several days.

anecdotal
Mark Santos · on Claude Opus 4.8
X·@markksantos·
Mixed
I think in terms of application functionality, quality, and design, Opus won … got stuck twice in a retry loop (had to prompt to self-correct).

Task Head-to-head vs Sakana Fugu Ultra: a single-file Three.js Crossy Road game.

prompt shownsingle run
Igor Kotenkov · on Gemini 3 Pro
X·@stalkermustang·
Mixed
It is great at writing - i'm using it to this day. It was good in one-shotting front-end. But agentic? … in my memory it was never a catch up in the most important and money making areas
anecdotalhigh-signal user
Field testComparisonWritingCodingAgents
View on X
AIAcademy · on Fugu Ultra
X·@AIAcademykorea·
Mixed
The 5-hour limit has been exceeded, so I have to wait 4 hours. However, it kindly provides guidance … I like this one better because it is user-oriented, offering friendly guidance for beginners and general users.
anecdotal
AM9:21 · on Fugu Ultra
X·@AM921543266·
Positive
It discovered 27 bugs that Fable 5 couldn't find and fixed all of them. The code quality is impeccable … it implemented about 70,000 lines of new features, resolved 4 issues, and created 7 PRs.

Task Introduced Fugu into a repo previously worked on with Claude Fable 5; ~1 hour of use.

anecdotaloutput shown
am.will · on Fugu Ultra
X·@LLMJunky·
Negative
The game was pretty bad and notably worse than GPT 5.5. … GPT 5.5 by contrast did a pretty good job and required no follow ups.

Task Asked it to build a Three.js replica of Rocket League via Codex.

anecdotalprompt shownsingle run
Field testQuotaComparisonCodingAgents
View on X
Mark Santos · on Fugu Ultra
X·@markksantos·
Mixed
In terms of model speed and performance, Fugu on Opencode won … inverted directional turn, wonky camera, no sfx, not identical to Crossy Road game.

Task Head-to-head vs Claude Opus 4.8: a single-file Three.js Crossy Road game.

prompt shownsingle run
Pranav Sriram · on GLM-5.2
X·@PranavSriram1·
Negative
For my research, Fable felt like a clear step change … I was excited about the GLM 5.2 hype and tried it; sadly it's nowhere close

Task Evaluating models for research work (alongside Fable and GPT-5.5 Pro).

anecdotal
Machina · on Claude Opus 4.8
X·@EXM7777·
Mixed
Opus 4.8 in the last 48hrs is amazing … it's just very sad to go from godlike performance to barely usable some days.
anecdotal
@ceo_tommy1 · on GPT-5.5 Pro
X·@ceo_tommy1·
Positive
It's way too convenient to make Codex handle GPT5.5Pro work, and it makes my tasks infinitely more productive.

Task Using GPT-5.5 Pro from the Codex CLI for day-to-day work.

anecdotalpaid user
@Hesamation · on GLM-5.2
X·@Hesamation·
Positive
GLM 5.2 ranks unusually high on FrontierSWE (long-horizon agentic engineering) … using it with OpenCode is also not far from the quality of Claude Code or Codex.

Task Day-to-day agentic coding with GLM-5.2 in OpenCode.

anecdotalhigh-signal user
Benchmark reproductionField testCodingAgents
View on X
Guillermo Rauch · on GLM-5.2
X·@rauchg·
Positive
Genuinely impressed, almost shocked, at how good GLM-5.2 … is at coding. This changes things.
anecdotalhigh-signal user
Theo · on GLM-5.2
X·@theo·
Mixed
Having an open weight model surpass GPT-5.4 and every Gemini model is dope. That said - it's not cheap. Both Opus 4.8 and GPT-5.5 set to "medium" are cheaper and smarter than GLM-5.2
anecdotalhigh-signal user
@spoobsV1 · on Claude Fable 5
X·@spoobsV1·
Positive
Wow Claude Fable 5 is insane!! It just recreated the 2011 game of the year … The Elder Scrolls V: Skyrim in ONE prompt.

Task The entire prompt was: make skyrim.

anecdotalprompt shown
elvis · on DeepSeek V4 Pro
X·@omarsar0·
Positive
I have been testing DeepSeek-V4-Pro with the Pi coding agent. I am mindblown by how well it works out of the box.

Task Built an LLM wiki with an agent powered entirely by DeepSeek-V4-Pro.

anecdotalhigh-signal user