evals.report
BenchmarksLabsCompareRun guides

GBA Eval asks a coding agent to build a Game Boy Advance emulator that compiles to a single WASM module, which the harness runs and grades frame-by-frame against a Mesen2-fork reference emulator (reference/mesen.wasm). You build the candidate WASM inside the provided Docker task container (or natively), then run the Rust `grader` crate, which emits per-test JSON, PNG frame diffs, and a summary.json with the overall score (0.60 replay + 0.20 procedural + 0.20 audio). Scoring is entirely local; keep attached to any score the harness commit, the corpus version, whether you graded via Docker (default) or --native, and that the reference binary was reference/mesen.wasm.

Benchmark
GBA Eval
Dataset
github.com/mechanize-work/gba-eval
Metric
overall score

1Install

shell
git clone https://github.com/mechanize-work/gba-eval.git
shell
git -C gba-eval submodule update --init --recursive
shell
git -C gba-eval lfs install && git -C gba-eval lfs pull
shell
cp gba-eval/quickstart/.env.example gba-eval/quickstart/.env
shell
cd gba-eval && ./quickstart/smoke-test.sh

2Run evaluation

shell
# From the repo root, bring up the task container (compose file lives in quickstart/):
cd quickstart && docker compose up -d && cd ..
shell
# Build the GBA emulator WASM inside the container (task is /task/TASK.md):
./quickstart/shell.sh cargo build --release --lib --target wasm32-unknown-unknown
shell
# (Optional) drive an agent in-container, e.g. Claude Code (install first, then run):
./quickstart/shell.sh bash -lc 'curl -fsSL https://claude.ai/install.sh | bash'
shell
./quickstart/shell.sh claude --task /task/TASK.md

3Score output

shell
# Grade the WASM built inside the container against the bundled Mesen2 reference (Docker, default):
./quickstart/grade.sh --from-container my-run
shell
# OR grade the in-repo baseline candidate (run-name 'baseline'):
./quickstart/grade.sh candidates/gba-core/gba_core_shim.wasm baseline
shell
# OR grade natively without Docker (requires local Rust/cargo + clang + cmake):
./quickstart/grade.sh --native path/to/my.wasm my-run
shell
# OR invoke the Rust grader crate directly (precompute reference frames first, then grade):
cargo run -p grader --release -- --precompute corpus/
shell
cargo run -p grader --release -- candidate.wasm corpus/ results/my-run/

4Expected output

Results land in ./results/<run-name>/ as per-testcase JSON files, PNG frame diffs, and a summary.json with the overall score. The overall score is computed as overall = 0.60 x replay + 0.20 x procedural + 0.20 x audio (Gameplay Replays 60%, Procedural Tests 20%, Audio 20%), with weights configurable in corpus/grader.yaml. The bundled baseline candidate (candidates/gba-core/gba_core_shim.wasm) is an intentionally partial implementation scoring overall ~0.53. This score is GBA-Eval-specific (frame-accuracy vs a Mesen2 fork) and must not be compared against other coding benchmarks.

5Submit results

Scoring is fully local: the grader writes results/<run>/summary.json with the overall score and the replay/procedural/audio sub-scores. There is no public self-serve submission flow in the repo (CONTACT.md only lists a legal-inquiry address, stephen@mechanize.work, and gbaeval.com presents results without a submission form). Report the overall score together with the three section sub-scores, the gba-eval harness commit, the corpus version, the reference binary (reference/mesen.wasm), and whether you graded via Docker (default) or --native. The repo itself is the canonical reproducible harness; reproduce locally rather than relying on a leaderboard.

Gotchas

The reference emulator (reference/mesen.wasm) and the corpus/reference-cache/ assets are stored via Git LFS (and the Mesen2 fork is a submodule). You MUST run `git lfs install && git lfs pull` (and `git submodule update --init --recursive` only if rebuilding the reference) or grading fails because the LFS files are still 'version https://git-lfs' pointers; smoke-test.sh explicitly detects and rejects these pointers.
grade.sh defaults to Docker; --from-container copies the WASM from /task/target/wasm32-unknown-unknown/release/gba_emu.wasm inside the task container, so your build must emit exactly that artifact. Use --native (requires local Rust/cargo + clang + cmake) only to skip Docker.
smoke-test.sh is a ~30s preflight only — it checks submodules, LFS, the baseline candidate, and the grader image, but does NOT run a real grade or precompute references. When invoking the grader crate directly, run `cargo run -p grader --release -- --precompute corpus/` before grading a candidate so reference frames exist.
The headline number is the weighted overall (replay 60% / procedural 20% / audio 20%) defined in corpus/grader.yaml — do not report the replay sub-score as the headline, and re-grade with the same corpus version when comparing runs.