evals.report
BenchmarksSourcesLabsCompareRun guides
01
SWE-bench VerifiedCoding/% resolved

Run the verified SWE-bench split with a fixed agent scaffold, repository setup, and scoring harness.

github.com/SWE-bench/SWE-bench7 commands
GuideOpen ↓
02
GPQA DiamondReasoning/accuracy

Evaluate multiple-choice science questions from the GPQA Diamond subset with a fixed prompt and answer extractor.

github.com/idavidrein/gpqa7 commands
GuideOpen ↓
03
LiveCodeBench ProCoding/Codeforces Elo

Problems and tooling are published; ratings are computed from live Codeforces-style contests.

github.com/GavinZhengOI/LiveCodeBench-Prosource instructions
GuideOpen ↓
04
Berkeley Function Calling LeaderboardTool use/accuracy

Official BFCL README documents install, generation, evaluation, and score output.

github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboardsource instructions
GuideOpen ↓
05
LiveBenchReasoning/score

Official repo includes run_livebench.py, scoring utilities, and download_leaderboard.py.

github.com/LiveBench/LiveBenchsource instructions
GuideOpen ↓
06
Terminal-Bench 2.1Agents/task success

Official repo includes tasks, Docker setup, adapters, and registry.

github.com/laude-institute/terminal-benchsource instructions
GuideOpen ↓
07
SWE-bench ProCoding/% resolved

Repo includes harness scripts, Dockerfiles, and run scripts.

github.com/scaleapi/SWE-bench_Pro-ossource instructions
GuideOpen ↓
08
DeepSWECoding/% resolved

Official guide documents Pier/Harbor-compatible execution with mini-swe-agent, subsets, single-task runs, and submission.

github.com/datacurve-ai/deep-swesource instructions
GuideOpen ↓
09
Humanity's Last ExamReasoning/accuracy

Dataset/eval access is public enough to document, but official run details vary.

source instructions
GuideOpen ↓
10
MMMU-ProMultimodal/accuracy

Repo has evaluation scripts and prompts for MMMU-Pro.

github.com/MMMU-Benchmark/MMMUsource instructions
GuideOpen ↓
11
ARC-AGI-3Reasoning/accuracy

Dataset/task execution is documented, but frontier submissions are competition-style.

source instructions
GuideOpen ↓
12
ARC-AGI-2Reasoning/accuracy

Tasks and evaluation are public; frontier scores are ARC-Prize-verified.

source instructions
GuideOpen ↓
13
AIME (OTIS Mock)Reasoning/accuracy

Problems and methodology are documented on the Epoch AI benchmarks hub.

source instructions
GuideOpen ↓
14
SimpleQA VerifiedOther/accuracy

Problems and methodology are documented on the Epoch AI benchmarks hub.

source instructions
GuideOpen ↓