Run GBA Eval
The same run guide is also available from the benchmark detail page.
GBA Eval asks a coding agent to build a Game Boy Advance emulator that compiles to a single WASM module, which the harness runs and grades frame-by-frame against a Mesen2-fork reference emulator (reference/mesen.wasm). You build the candidate WASM inside the provided Docker task container (or natively), then run the Rust `grader` crate, which emits per-test JSON, PNG frame diffs, and a summary.json with the overall score (0.60 replay + 0.20 procedural + 0.20 audio). Scoring is entirely local; keep attached to any score the harness commit, the corpus version, whether you graded via Docker (default) or --native, and that the reference binary was reference/mesen.wasm.
1Install
git clone https://github.com/mechanize-work/gba-eval.gitgit -C gba-eval submodule update --init --recursivegit -C gba-eval lfs install && git -C gba-eval lfs pullcp gba-eval/quickstart/.env.example gba-eval/quickstart/.envcd gba-eval && ./quickstart/smoke-test.sh2Run evaluation
# From the repo root, bring up the task container (compose file lives in quickstart/):
cd quickstart && docker compose up -d && cd ..# Build the GBA emulator WASM inside the container (task is /task/TASK.md):
./quickstart/shell.sh cargo build --release --lib --target wasm32-unknown-unknown# (Optional) drive an agent in-container, e.g. Claude Code (install first, then run):
./quickstart/shell.sh bash -lc 'curl -fsSL https://claude.ai/install.sh | bash'./quickstart/shell.sh claude --task /task/TASK.md3Score output
# Grade the WASM built inside the container against the bundled Mesen2 reference (Docker, default):
./quickstart/grade.sh --from-container my-run# OR grade the in-repo baseline candidate (run-name 'baseline'):
./quickstart/grade.sh candidates/gba-core/gba_core_shim.wasm baseline# OR grade natively without Docker (requires local Rust/cargo + clang + cmake):
./quickstart/grade.sh --native path/to/my.wasm my-run# OR invoke the Rust grader crate directly (precompute reference frames first, then grade):
cargo run -p grader --release -- --precompute corpus/cargo run -p grader --release -- candidate.wasm corpus/ results/my-run/4Expected output
Results land in ./results/<run-name>/ as per-testcase JSON files, PNG frame diffs, and a summary.json with the overall score. The overall score is computed as overall = 0.60 x replay + 0.20 x procedural + 0.20 x audio (Gameplay Replays 60%, Procedural Tests 20%, Audio 20%), with weights configurable in corpus/grader.yaml. The bundled baseline candidate (candidates/gba-core/gba_core_shim.wasm) is an intentionally partial implementation scoring overall ~0.53. This score is GBA-Eval-specific (frame-accuracy vs a Mesen2 fork) and must not be compared against other coding benchmarks.
5Submit results
Scoring is fully local: the grader writes results/<run>/summary.json with the overall score and the replay/procedural/audio sub-scores. There is no public self-serve submission flow in the repo (CONTACT.md only lists a legal-inquiry address, stephen@mechanize.work, and gbaeval.com presents results without a submission form). Report the overall score together with the three section sub-scores, the gba-eval harness commit, the corpus version, the reference binary (reference/mesen.wasm), and whether you graded via Docker (default) or --native. The repo itself is the canonical reproducible harness; reproduce locally rather than relying on a leaderboard.