Run guides
Step-by-step instructions for running official benchmark harnesses against your own models. Each guide keeps the repo, dataset, and run context attached.
Run the verified SWE-bench split with a fixed agent scaffold, repository setup, and scoring harness.
Evaluate multiple-choice science questions from the GPQA Diamond subset with a fixed prompt and answer extractor.
Problems and tooling are published; ratings are computed from live Codeforces-style contests.
Official BFCL README documents install, generation, evaluation, and score output.
Official repo includes run_livebench.py, scoring utilities, and download_leaderboard.py.
Official repo includes tasks, Docker setup, adapters, and registry.
Repo includes harness scripts, Dockerfiles, and run scripts.
Official guide documents Pier/Harbor-compatible execution with mini-swe-agent, subsets, single-task runs, and submission.
Dataset/eval access is public enough to document, but official run details vary.
Repo has evaluation scripts and prompts for MMMU-Pro.
Dataset/task execution is documented, but frontier submissions are competition-style.
Tasks and evaluation are public; frontier scores are ARC-Prize-verified.
Problems and methodology are documented on the Epoch AI benchmarks hub.
Problems and methodology are documented on the Epoch AI benchmarks hub.