evals.report
BenchmarksSourcesLabsCompareRun guides
evals.report·v0.1Reference

Official LLM eval scores, model progress, and benchmark run guides.

Scores stay attached to their benchmark, source, and run context. No composite rankings. No “best model.”

⌘K
16 benchmarks·84 models·14 labs·Last sync 2026·06·01

Index 04

Benchmarks 06

See all
01
SWE-bench VerifiedCoding/% resolved/↑ higher

A curated SWE-bench split for evaluating systems that resolve real software engineering issues.

github.com/SWE-bench/SWE-bench24 reportedguide available
88.6%
Top reported · Claude Opus 4.8
02
GPQA DiamondReasoning/accuracy/↑ higher

A difficult subset of GPQA for graduate-level science question answering evaluation.

github.com/idavidrein/gpqa24 reportedguide available
94.6%
Top reported · GPT-5.4 Pro
03
LiveCodeBench ProCoding/Codeforces Elo/↑ higher

A live competitive-programming benchmark that rates LLMs with a Codeforces-style Elo on fresh contest problems.

github.com/GavinZhengOI/LiveCodeBench-Pro14 reportedguide available
3298
Top reported · Gemini 3 Deep Think
04
Berkeley Function Calling LeaderboardTool use/accuracy/↑ higher

A function-calling and tool-use benchmark covering single-turn, multi-turn, live, and agentic scenarios.

github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard19 reportedguide available
77.47%
Top reported · Claude Opus 4.5
05
LiveBenchReasoning/score/↑ higher

A frequently updated public benchmark suite spanning reasoning, coding, math, language, and instruction-following tasks.

github.com/LiveBench/LiveBench16 reportedguide available
80.71%
Top reported · GPT-5.5
06
Terminal-Bench 2.1Agents/task success/↑ higher

A command-line agent benchmark for completing terminal tasks in reproducible task environments.

github.com/laude-institute/terminal-bench11 reportedguide available
83.4%
Top reported · Codex CLI + GPT-5.5

Source coverage

All sources
Ready nowRaw JSONStructured dataRun guide readyMachine-readable

SWE-bench Verified

Canonical software-engineering agent benchmark already in product scope.

Category
Coding
Owner
SWE-bench
Data path
Official leaderboard rows and per-instance metadata can be shown with scaffold and tool context preserved.
View source Official
Ready nowResult archiveReview neededRun guide readyPublic data

Berkeley Function Calling Leaderboard

Strong public benchmark for function calling, multi-turn, live, and agentic tool categories.

Category
Tool use
Owner
UC Berkeley Gorilla
Data path
Use the latest dated result archive after matching it to the public leaderboard. Prefer category rows first.
View source Official
Ready nowResult archiveStructured dataPartial run guidePublic data

LiveBench

Broad public eval with frequently updated releases across reasoning, coding, math, and instruction following.

Category
Reasoning
Owner
LiveBench
Data path
Use the current release table CSV; the headline score is the global average across the six task categories.
View source Official

Compare

Open
BenchmarkGPT-5 highOpenAIClaude Sonnet 4.5AnthropicGemini 2.5 ProGoogle DeepMindDeepSeek R1DeepSeek
SWE-bench Verified% resolved73.6%71.3%
GPQA Diamondaccuracy86.2%85.3%

Run guides 04

All guides