evals.report
BenchmarksSourcesLabsCompareRun guides
Ready nowRaw JSONStructured dataRun guide readyMachine-readable

SWE-bench Verified

Canonical software-engineering agent benchmark already in product scope.

Category
Coding
Owner
SWE-bench
Data path
Official leaderboard rows and per-instance metadata can be shown with scaffold and tool context preserved.
View source Official
NextRaw JSONStructured dataPartial run guidePublic data

LiveCodeBench Pro

The actively-updated 2026 successor to LiveCodeBench; rates frontier models on live competitive-programming contests.

Category
Coding
Owner
LiveCodeBench Pro
Data path
Use the official leaderboard Elo ratings; the metric is an Elo, not a percentage.
View source Official
Ready nowResult archiveReview neededRun guide readyPublic data

Berkeley Function Calling Leaderboard

Strong public benchmark for function calling, multi-turn, live, and agentic tool categories.

Category
Tool use
Owner
UC Berkeley Gorilla
Data path
Use the latest dated result archive after matching it to the public leaderboard. Prefer category rows first.
View source Official
Ready nowResult archiveStructured dataPartial run guidePublic data

LiveBench

Broad public eval with frequently updated releases across reasoning, coding, math, and instruction following.

Category
Reasoning
Owner
LiveBench
Data path
Use the current release table CSV; the headline score is the global average across the six task categories.
View source Official
Ready nowHF datasetReview neededRun guide readyPublic data

Terminal-Bench 2.1

Important command-line agent benchmark with task registry and adapter-sensitive results.

Category
Agents
Owner
Harbor / Laude Institute
Data path
Use page and HF rows with agent name, model, and task-set version kept separate.
View source Official
Ready nowStatic HTMLReview neededRun guide readyPublic data

SWE-bench Pro

Harder public follow-on to SWE-bench with professional software tasks.

Category
Coding
Owner
Scale AI
Data path
Use official page rows and repo result JSON together where both are available.
View source Official
Ready nowStatic HTMLReview neededRun guide readyPublic data

DeepSWE

Long-horizon software-engineering benchmark with original tasks, broad repo coverage, and behavioral verifiers.

Category
Coding
Owner
DataCurve
Data path
Start with official blog rows and task manifest, then add trial-level detail when the raw index is pinned.
View source Official
NextManual curatedReview neededRun guide readyCurated source

GPQA Diamond

Widely cited graduate-level science QA benchmark already in product scope.

Category
Reasoning
Owner
GPQA authors
Data path
Use curated source-linked rows from model-system cards and lab release tables.
View source Official
NextManual curatedWatchlistPartial run guidePage-backed data

Humanity's Last Exam

High-visibility frontier benchmark with difficult expert questions.

Category
Reasoning
Owner
Humanity's Last Exam
Data path
Use only after each score row has source verification and retrieved-at metadata.
View source Official
NextRaw JSONStructured dataRun guide readyPublic data

MMMU-Pro

Leading multimodal reasoning benchmark; MMMU-Pro is the harder variant frontier models still report.

Category
Multimodal
Owner
MMMU Benchmark
Data path
Use the official leaderboard's MMMU-Pro overall accuracy; keep tool use / thinking as run context.
View source Official
NextGCS bucketReview neededRun guide readyMachine-readable

HELM

Reproducible benchmark output and schemas across many scenarios.

Category
Other
Owner
Stanford CRFM
Data path
Index releases and suites first, then show selected scenarios only.
View source Official
LaterHF datasetWatchlistNo run guidePage-backed data

LMArena

Public preference signal users care about.

Category
Chat preference
Owner
LMArena
Data path
Treat Arena score/Elo-style ratings as benchmark-specific metrics only.
View source Official
LaterRaw JSONStructured dataPartial run guidePublic data

ARC-AGI-3

Frontier interactive reasoning/generalization benchmark; current models still score near zero.

Category
Reasoning
Owner
ARC Prize
Data path
Use the official v3.json leaderboard rows with reported cost preserved.
View source Official
LaterRaw JSONStructured dataPartial run guidePublic data

ARC-AGI-2

Widely cited abstraction/generalization benchmark; 2026 frontier models cleared it from ~6% to ~85%.

Category
Reasoning
Owner
ARC Prize
Data path
Use the official v2.json leaderboard rows; reasoning effort is encoded in each model label.
View source Official
WatchlistManual curatedLimited accessRun guide blockedLimited public data

FrontierMath

High-signal frontier math benchmark.

Category
Reasoning
Owner
Epoch AI
Data path
Keep on the watchlist until public data and run instructions are stable.
View source Official
NextRaw JSONStructured dataPartial run guidePublic data

AIME (OTIS Mock)

Competition-math reasoning benchmark with a consistent, frequently-updated independent leaderboard across all frontier models.

Category
Reasoning
Owner
Epoch AI
Data path
Use Epoch's per-model mean accuracy; keep reasoning effort as run context.
View source Official
NextRaw JSONStructured dataPartial run guidePublic data

SimpleQA Verified

Factual-accuracy / hallucination benchmark with a consistent independent leaderboard across frontier models.

Category
Other
Owner
Epoch AI
Data path
Use Epoch's per-model accuracy on SimpleQA Verified.
View source Official
WatchlistEvaluator referenceReview neededRun guide readyPublic data

OpenAI simple-evals

Useful evaluator reference for GPQA, SimpleQA, HealthBench, BrowseComp, MMLU, MATH, MGSM, and DROP.

Category
Other
Owner
OpenAI
Data path
Use as run-guide and evaluator reference for source-linked lab tables.
View source Official