evals.report
BenchmarksSourcesLabsCompareRun guides
01
SWE-bench VerifiedCoding/% resolved/↑ higher

A curated SWE-bench split for evaluating systems that resolve real software engineering issues.

github.com/SWE-bench/SWE-bench24 reportedguide available
88.6%
Top reported · Claude Opus 4.8
02
GPQA DiamondReasoning/accuracy/↑ higher

A difficult subset of GPQA for graduate-level science question answering evaluation.

github.com/idavidrein/gpqa24 reportedguide available
94.6%
Top reported · GPT-5.4 Pro
03
LiveCodeBench ProCoding/Codeforces Elo/↑ higher

A live competitive-programming benchmark that rates LLMs with a Codeforces-style Elo on fresh contest problems.

github.com/GavinZhengOI/LiveCodeBench-Pro14 reportedguide available
3298
Top reported · Gemini 3 Deep Think
04
Berkeley Function Calling LeaderboardTool use/accuracy/↑ higher

A function-calling and tool-use benchmark covering single-turn, multi-turn, live, and agentic scenarios.

github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard19 reportedguide available
77.47%
Top reported · Claude Opus 4.5
05
LiveBenchReasoning/score/↑ higher

A frequently updated public benchmark suite spanning reasoning, coding, math, language, and instruction-following tasks.

github.com/LiveBench/LiveBench16 reportedguide available
80.71%
Top reported · GPT-5.5
06
Terminal-Bench 2.1Agents/task success/↑ higher

A command-line agent benchmark for completing terminal tasks in reproducible task environments.

github.com/laude-institute/terminal-bench11 reportedguide available
83.4%
Top reported · Codex CLI + GPT-5.5
07
SWE-bench ProCoding/% resolved/↑ higher

A harder public software-engineering agent benchmark built around professional repository tasks.

github.com/scaleapi/SWE-bench_Pro-os22 reportedguide available
59.10%
Top reported · GPT-5.4 xHigh
08
DeepSWECoding/% resolved/↑ higher

A long-horizon software-engineering benchmark with original tasks, broad repository coverage, and behavioral verifiers.

github.com/datacurve-ai/deep-swe12 reportedguide available
70.05%
Top reported · GPT-5.5
09
Humanity's Last ExamReasoning/accuracy/↑ higher

A broad expert-level academic question-answering benchmark for frontier reasoning systems.

21 reportedguide available
49.8%
Top reported · Claude Opus 4.8
10
MMMU-ProMultimodal/accuracy/↑ higher

The harder MMMU-Pro multimodal reasoning benchmark (college-level subject tasks with text and images); the variant current frontier models report.

github.com/MMMU-Benchmark/MMMU13 reportedguide available
82.1%
Top reported · GPT-5.4
11
LMArenaChat preference/source-defined rating/↑ higher

A public chat-preference evaluation surface with source-defined preference ratings and model comparisons.

20 reported
1499
Top reported · Claude Opus 4.6 thinking
12
ARC-AGI-3Reasoning/accuracy/↑ higher

The interactive ARC-AGI-3 generalization benchmark: agents must learn novel game environments from scratch (semi-private set).

6 reportedguide available
0.51%
Top reported · Claude Opus 4.6
13
ARC-AGI-2Reasoning/accuracy/↑ higher

The ARC-AGI-2 abstract-reasoning puzzle benchmark (semi-private set), the harder static successor to ARC-AGI-1.

12 reportedguide available
85%
Top reported · GPT-5.5
14
FrontierMathReasoning/accuracy/↑ higher

A frontier math benchmark with constrained public access and source-linked result claims.

22 reported
52.4%
Top reported · GPT-5.5 Pro
15
AIME (OTIS Mock)Reasoning/accuracy/↑ higher

Competition mathematics in the AIME format (Epoch AI's OTIS Mock AIME 2024-2025 set), a high-signal short-answer math reasoning benchmark.

20 reportedguide available
100.0%
Top reported · GPT-5.5 Pro
16
SimpleQA VerifiedOther/accuracy/↑ higher

A factual short-answer QA benchmark measuring parametric knowledge and hallucination resistance (Epoch AI's SimpleQA Verified).

20 reportedguide available
77.3%
Top reported · Gemini 3.1 Pro Preview