Official eval sources | evals.report

Ready nowRaw JSONStructured dataRun guide readyMachine-readable

SWE-bench Verified

Canonical software-engineering agent benchmark already in product scope.

Category: Coding
Owner: SWE-bench
Data path: Official leaderboard rows and per-instance metadata can be shown with scaffold and tool context preserved.

View source Official

NextRaw JSONStructured dataPartial run guidePublic data

LiveCodeBench Pro

The actively-updated 2026 successor to LiveCodeBench; rates frontier models on live competitive-programming contests.

Category: Coding
Owner: LiveCodeBench Pro
Data path: Use the official leaderboard Elo ratings; the metric is an Elo, not a percentage.

View source Official

Ready nowResult archiveReview neededRun guide readyPublic data

Berkeley Function Calling Leaderboard

Strong public benchmark for function calling, multi-turn, live, and agentic tool categories.

Category: Tool use
Owner: UC Berkeley Gorilla
Data path: Use the latest dated result archive after matching it to the public leaderboard. Prefer category rows first.

View source Official

Ready nowResult archiveStructured dataPartial run guidePublic data

LiveBench

Broad public eval with frequently updated releases across reasoning, coding, math, and instruction following.

Category: Reasoning
Owner: LiveBench
Data path: Use the current release table CSV; the headline score is the global average across the six task categories.

View source Official

Ready nowHF datasetReview neededRun guide readyPublic data

Terminal-Bench 2.1

Important command-line agent benchmark with task registry and adapter-sensitive results.

Category: Agents
Owner: Harbor / Laude Institute
Data path: Use page and HF rows with agent name, model, and task-set version kept separate.

View source Official

Ready nowStatic HTMLReview neededRun guide readyPublic data

SWE-bench Pro

Harder public follow-on to SWE-bench with professional software tasks.

Category: Coding
Owner: Scale AI
Data path: Use official page rows and repo result JSON together where both are available.

View source Official

Ready nowStatic HTMLReview neededRun guide readyPublic data

DeepSWE

Long-horizon software-engineering benchmark with original tasks, broad repo coverage, and behavioral verifiers.

Category: Coding
Owner: DataCurve
Data path: Start with official blog rows and task manifest, then add trial-level detail when the raw index is pinned.

View source Official

NextManual curatedReview neededRun guide readyCurated source

GPQA Diamond

Widely cited graduate-level science QA benchmark already in product scope.

Category: Reasoning
Owner: GPQA authors
Data path: Use curated source-linked rows from model-system cards and lab release tables.

View source Official

NextManual curatedWatchlistPartial run guidePage-backed data

Humanity's Last Exam

High-visibility frontier benchmark with difficult expert questions.

Category: Reasoning
Owner: Humanity's Last Exam
Data path: Use only after each score row has source verification and retrieved-at metadata.

View source Official

NextRaw JSONStructured dataRun guide readyPublic data

MMMU-Pro

Leading multimodal reasoning benchmark; MMMU-Pro is the harder variant frontier models still report.

Category: Multimodal
Owner: MMMU Benchmark
Data path: Use the official leaderboard's MMMU-Pro overall accuracy; keep tool use / thinking as run context.

View source Official

NextGCS bucketReview neededRun guide readyMachine-readable

HELM

Reproducible benchmark output and schemas across many scenarios.

Category: Other
Owner: Stanford CRFM
Data path: Index releases and suites first, then show selected scenarios only.

View source Official

LaterHF datasetWatchlistNo run guidePage-backed data

LMArena

Public preference signal users care about.

Category: Chat preference
Owner: LMArena
Data path: Treat Arena score/Elo-style ratings as benchmark-specific metrics only.

View source Official

LaterRaw JSONStructured dataPartial run guidePublic data

ARC-AGI-3

Frontier interactive reasoning/generalization benchmark; current models still score near zero.

Category: Reasoning
Owner: ARC Prize
Data path: Use the official v3.json leaderboard rows with reported cost preserved.

View source Official

LaterRaw JSONStructured dataPartial run guidePublic data

ARC-AGI-2

Widely cited abstraction/generalization benchmark; 2026 frontier models cleared it from ~6% to ~85%.

Category: Reasoning
Owner: ARC Prize
Data path: Use the official v2.json leaderboard rows; reasoning effort is encoded in each model label.

View source Official

WatchlistManual curatedLimited accessRun guide blockedLimited public data

FrontierMath

High-signal frontier math benchmark.

Category: Reasoning
Owner: Epoch AI
Data path: Keep on the watchlist until public data and run instructions are stable.

View source Official

NextRaw JSONStructured dataPartial run guidePublic data

AIME (OTIS Mock)

Competition-math reasoning benchmark with a consistent, frequently-updated independent leaderboard across all frontier models.

Category: Reasoning
Owner: Epoch AI
Data path: Use Epoch's per-model mean accuracy; keep reasoning effort as run context.

View source Official

NextRaw JSONStructured dataPartial run guidePublic data

SimpleQA Verified

Factual-accuracy / hallucination benchmark with a consistent independent leaderboard across frontier models.

Category: Other
Owner: Epoch AI
Data path: Use Epoch's per-model accuracy on SimpleQA Verified.

View source Official

WatchlistEvaluator referenceReview neededRun guide readyPublic data

OpenAI simple-evals

Useful evaluator reference for GPQA, SimpleQA, HealthBench, BrowseComp, MMLU, MATH, MGSM, and DROP.

Category: Other
Owner: OpenAI
Data path: Use as run-guide and evaluator reference for source-linked lab tables.

View source Official