Ready nowRaw JSONStructured dataRun guide readyMachine-readable
SWE-bench Verified
Canonical software-engineering agent benchmark already in product scope.
- Category
- Coding
- Owner
- SWE-bench
- Data path
- Official leaderboard rows and per-instance metadata can be shown with scaffold and tool context preserved.
NextRaw JSONStructured dataPartial run guidePublic data
LiveCodeBench Pro
The actively-updated 2026 successor to LiveCodeBench; rates frontier models on live competitive-programming contests.
- Category
- Coding
- Owner
- LiveCodeBench Pro
- Data path
- Use the official leaderboard Elo ratings; the metric is an Elo, not a percentage.
Ready nowResult archiveReview neededRun guide readyPublic data
Berkeley Function Calling Leaderboard
Strong public benchmark for function calling, multi-turn, live, and agentic tool categories.
- Category
- Tool use
- Owner
- UC Berkeley Gorilla
- Data path
- Use the latest dated result archive after matching it to the public leaderboard. Prefer category rows first.
Ready nowResult archiveStructured dataPartial run guidePublic data
LiveBench
Broad public eval with frequently updated releases across reasoning, coding, math, and instruction following.
- Category
- Reasoning
- Owner
- LiveBench
- Data path
- Use the current release table CSV; the headline score is the global average across the six task categories.
Ready nowHF datasetReview neededRun guide readyPublic data
Terminal-Bench 2.1
Important command-line agent benchmark with task registry and adapter-sensitive results.
- Category
- Agents
- Owner
- Harbor / Laude Institute
- Data path
- Use page and HF rows with agent name, model, and task-set version kept separate.
Ready nowStatic HTMLReview neededRun guide readyPublic data
SWE-bench Pro
Harder public follow-on to SWE-bench with professional software tasks.
- Category
- Coding
- Owner
- Scale AI
- Data path
- Use official page rows and repo result JSON together where both are available.
Ready nowStatic HTMLReview neededRun guide readyPublic data
DeepSWE
Long-horizon software-engineering benchmark with original tasks, broad repo coverage, and behavioral verifiers.
- Category
- Coding
- Owner
- DataCurve
- Data path
- Start with official blog rows and task manifest, then add trial-level detail when the raw index is pinned.
NextManual curatedReview neededRun guide readyCurated source
GPQA Diamond
Widely cited graduate-level science QA benchmark already in product scope.
- Category
- Reasoning
- Owner
- GPQA authors
- Data path
- Use curated source-linked rows from model-system cards and lab release tables.
NextManual curatedWatchlistPartial run guidePage-backed data
Humanity's Last Exam
High-visibility frontier benchmark with difficult expert questions.
- Category
- Reasoning
- Owner
- Humanity's Last Exam
- Data path
- Use only after each score row has source verification and retrieved-at metadata.
NextRaw JSONStructured dataRun guide readyPublic data
MMMU-Pro
Leading multimodal reasoning benchmark; MMMU-Pro is the harder variant frontier models still report.
- Category
- Multimodal
- Owner
- MMMU Benchmark
- Data path
- Use the official leaderboard's MMMU-Pro overall accuracy; keep tool use / thinking as run context.
NextGCS bucketReview neededRun guide readyMachine-readable
HELM
Reproducible benchmark output and schemas across many scenarios.
- Category
- Other
- Owner
- Stanford CRFM
- Data path
- Index releases and suites first, then show selected scenarios only.
LaterHF datasetWatchlistNo run guidePage-backed data
LMArena
Public preference signal users care about.
- Category
- Chat preference
- Owner
- LMArena
- Data path
- Treat Arena score/Elo-style ratings as benchmark-specific metrics only.
LaterRaw JSONStructured dataPartial run guidePublic data
ARC-AGI-3
Frontier interactive reasoning/generalization benchmark; current models still score near zero.
- Category
- Reasoning
- Owner
- ARC Prize
- Data path
- Use the official v3.json leaderboard rows with reported cost preserved.
LaterRaw JSONStructured dataPartial run guidePublic data
ARC-AGI-2
Widely cited abstraction/generalization benchmark; 2026 frontier models cleared it from ~6% to ~85%.
- Category
- Reasoning
- Owner
- ARC Prize
- Data path
- Use the official v2.json leaderboard rows; reasoning effort is encoded in each model label.
WatchlistManual curatedLimited accessRun guide blockedLimited public data
FrontierMath
High-signal frontier math benchmark.
- Category
- Reasoning
- Owner
- Epoch AI
- Data path
- Keep on the watchlist until public data and run instructions are stable.
NextRaw JSONStructured dataPartial run guidePublic data
AIME (OTIS Mock)
Competition-math reasoning benchmark with a consistent, frequently-updated independent leaderboard across all frontier models.
- Category
- Reasoning
- Owner
- Epoch AI
- Data path
- Use Epoch's per-model mean accuracy; keep reasoning effort as run context.
NextRaw JSONStructured dataPartial run guidePublic data
SimpleQA Verified
Factual-accuracy / hallucination benchmark with a consistent independent leaderboard across frontier models.
- Category
- Other
- Owner
- Epoch AI
- Data path
- Use Epoch's per-model accuracy on SimpleQA Verified.
WatchlistEvaluator referenceReview neededRun guide readyPublic data
OpenAI simple-evals
Useful evaluator reference for GPQA, SimpleQA, HealthBench, BrowseComp, MMLU, MATH, MGSM, and DROP.
- Category
- Other
- Owner
- OpenAI
- Data path
- Use as run-guide and evaluator reference for source-linked lab tables.