evals.report
BenchmarksLabsCompareRun guides

GBA Eval

Frontier coding agents get 24 hours to write a complete Game Boy Advance emulator (Rust + WebAssembly) from scratch, graded against the Mesen2 reference emulator.

Codingoverall scoreHigher is better
ModelLabScoreSource modelStatusDate
Claude Opus 4.8Anthropic70.9%Claude Opus 4.8OfficialMay 30, 2026Details
GPT-5.5OpenAI53.2%GPT-5.5OfficialMay 3, 2026Details
Claude Sonnet 4.6Anthropic48.8%Claude Sonnet 4.6OfficialMay 3, 2026Details
Claude Opus 4.6Anthropic44.1%Claude Opus 4.6OfficialMay 2, 2026Details
Claude Opus 4.7Anthropic43.8%Claude Opus 4.7OfficialMay 2, 2026Details
GPT-5.4OpenAI31.6%GPT-5.4OfficialMay 12, 2026Details
Gemini 3.5 FlashGoogle DeepMind6.7%Gemini 3.5 FlashOfficialMay 21, 2026Details
Kimi K2.6Moonshot AI0.9%Kimi K2.6OfficialMay 2, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind0.8%Gemini 3.1 ProOfficialMay 3, 2026Details

Each row reports the model’s overall score on GBA Eval. Click a row for the full run context.