How to run LLM benchmarks — step-by-step guides

SWE-bench VerifiedCoding/% resolved

SWE-bench Verified is run locally with the official `swebench` harness (Docker-based).

github.com/SWE-bench/SWE-bench7 commands

GuideOpen ↓

Terminal-Bench 2.1Agents/task success

Terminal-Bench evaluates AI agents on real terminal/command-line tasks inside sandboxed Docker containers.

github.com/laude-institute/terminal-bench10 commands

GuideOpen ↓

DeepSWECoding/% resolved

DeepSWE is a 113-task long-horizon SWE benchmark (TypeScript, Go, Python, JavaScript, Rust) using the Harbor task format with program-based behavioral verifiers.

github.com/datacurve-ai/deep-swe9 commands

GuideOpen ↓

GPQA DiamondReasoning/accuracy

GPQA Diamond is a 448-question graduate-level science multiple-choice set; the score is exact-match accuracy on the A/B/C/D answer.

github.com/idavidrein/gpqa5 commands

GuideOpen ↓

LiveCodeBench ProCoding/Codeforces Elo

LiveCodeBench Pro is a competitive-programming eval where your model generates C++ solutions that are judged on real testcases.

github.com/GavinZhengOI/LiveCodeBench-Pro8 commands

GuideOpen ↓

Humanity's Last ExamReasoning/accuracy

Humanity's Last Exam (HLE) is run with the official centerforaisafety/hle harness: load the gated cais/hle test set (2,500 questions, text + image), generate predictions against your own OpenAI-compatible model endpoint with run_model_predictions.py, then grade them with run_judge_results.py, which uses an LLM judge (default o3-mini-2025-01-31) to emit accuracy and calibration error.

12 commands

GuideOpen ↓

LiveBenchReasoning/score

LiveBench is a contamination-free LLM benchmark with objective ground-truth scoring across six categories (reasoning, math, coding, language, data analysis, instruction following).

github.com/LiveBench/LiveBench12 commands

GuideOpen ↓

SWE-bench ProCoding/% resolved

SWE-bench Pro is run via Scale AI's official open-source harness (scaleapi/SWE-bench_Pro-os) against the public ScaleAI/SWE-bench_Pro dataset (single 731-instance test split).

github.com/scaleapi/SWE-bench_Pro-os7 commands

GuideOpen ↓

Berkeley Function Calling LeaderboardTool use/accuracy

BFCL is run via the official `bfcl-eval` Python package (the berkeley-function-call-leaderboard subdirectory of ShishirPatil/gorilla).

github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard10 commands

GuideOpen ↓

MMMU-ProMultimodal/accuracy

MMMU-Pro is run via the official MMMU-Benchmark/MMMU repo's mmmu-pro/ subdirectory.

github.com/MMMU-Benchmark/MMMU8 commands

GuideOpen ↓

ARC-AGI-2Reasoning/accuracy

ARC-AGI-2 ships its public eval set (120 tasks) and training set (1000 tasks) as JSON in the arcprize/ARC-AGI-2 repo, but that repo has no harness, no scoring script, and no apps/ folder.

8 commands

GuideOpen ↓

ARC-AGI-3Reasoning/accuracy

ARC-AGI-3 is an interactive reasoning benchmark of novel grid-based games played through the ARC-AGI-3 API.

6 commands

GuideOpen ↓

AIME (OTIS Mock)Reasoning/accuracy

OTIS Mock AIME 2024-2025 is Epoch AI's 45-problem competition-math benchmark (integer answers 0-999) implemented as an inspect_ai task.

9 commands

GuideOpen ↓

SimpleQA VerifiedOther/accuracy

SimpleQA Verified is Google DeepMind/Google Research's curated 1,000-question version of OpenAI SimpleQA, measuring short-form parametric factuality with no tools.

6 commands

GuideOpen ↓

GBA EvalCoding/overall score

GBA Eval asks a coding agent to build a Game Boy Advance emulator that compiles to a single WASM module, which the harness runs and grades frame-by-frame against a Mesen2-fork reference emulator (reference/mesen.wasm).

github.com/mechanize-work/gba-eval14 commands

GuideOpen ↓

MCP AtlasTool use/pass rate

MCP Atlas runs a tool-use task suite against Dockerized MCP servers (36 servers, 307 tools per README / ~220 advertised) and scores model trajectories with an LLM-as-judge (gemini/gemini-2.5-pro by default) over claims-based rubrics, producing a pass rate.

github.com/scaleapi/mcp-atlas12 commands

GuideOpen ↓