BenchmarksCoding
GBA Eval
Frontier coding agents get 24 hours to write a complete Game Boy Advance emulator (Rust + WebAssembly) from scratch, graded against the Mesen2 reference emulator.
Codingoverall scoreHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Opus 4.8 | Anthropic | 70.9% | Claude Opus 4.8 | Official | May 30, 2026 | Details |
| GPT-5.5 | OpenAI | 53.2% | GPT-5.5 | Official | May 3, 2026 | Details |
| Claude Sonnet 4.6 | Anthropic | 48.8% | Claude Sonnet 4.6 | Official | May 3, 2026 | Details |
| Claude Opus 4.6 | Anthropic | 44.1% | Claude Opus 4.6 | Official | May 2, 2026 | Details |
| Claude Opus 4.7 | Anthropic | 43.8% | Claude Opus 4.7 | Official | May 2, 2026 | Details |
| GPT-5.4 | OpenAI | 31.6% | GPT-5.4 | Official | May 12, 2026 | Details |
| Gemini 3.5 Flash | Google DeepMind | 6.7% | Gemini 3.5 Flash | Official | May 21, 2026 | Details |
| Kimi K2.6 | Moonshot AI | 0.9% | Kimi K2.6 | Official | May 2, 2026 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 0.8% | Gemini 3.1 Pro | Official | May 3, 2026 | Details |
Each row reports the model’s overall score on GBA Eval. Click a row for the full run context.