evals.report
BenchmarksSourcesLabsCompareRun guides
LabsAnthropic

Anthropic

Model lab for Claude public benchmark rows.

14 models59 results anthropic.com

Models 14

Progress by benchmark

Show progress on
Claude 3.5 Sonnet
Jun 20, 2024
Claude 3.7 Sonnet
Feb 24, 2025
61.0%
Claude Sonnet 4
May 22, 2025
Claude Opus 4
May 22, 2025
70.7%
Claude Opus 4.1
Aug 5, 2025
73.3%
Claude Sonnet 4.5
Sep 29, 2025
71.3%
Claude Haiku 4.5
Oct 1, 2025
Claude Opus 4.5
Nov 1, 2025
76.7%
Claude Opus 4.6
Feb 5, 2026
78.7%
Claude Opus 4.6 thinking
Feb 5, 2026
Claude Sonnet 4.6
Feb 5, 2026
75.2%
Claude Opus 4.7
Apr 16, 2026
83.5%
Claude Opus 4.7 thinking
Apr 16, 2026
Claude Opus 4.8
May 28, 2026
88.6%
Single benchmark only
This view shows SWE-bench Verified (% resolved) only. Other benchmarks use different metrics and are not directly comparable.

Progress matrix

ModelSWE-bench Verified
% resolved
GPQA Diamond
accuracy
LiveCodeBench Pro
Codeforces Elo
Berkeley Function Calling Leaderboard
accuracy
LiveBench
score
Terminal-Bench 2.1
task success
SWE-bench Pro
% resolved
DeepSWE
% resolved
Humanity's Last Exam
accuracy
MMMU-Pro
accuracy
LMArena
source-defined rating
ARC-AGI-3
accuracy
ARC-AGI-2
accuracy
FrontierMath
accuracy
AIME (OTIS Mock)
accuracy
SimpleQA Verified
accuracy
Claude 3.5 Sonnet
Claude Sonnet
Claude 3.7 Sonnet
Claude Sonnet
61.0%
Claude Sonnet 4
Claude Sonnet
42.70%5.93%
Claude Opus 4
Claude Opus
70.7%
Claude Opus 4.1
Claude Opus
73.3%
Claude Sonnet 4.5
Claude Sonnet
71.3%141273.24%43.60%68.9%
Claude Haiku 4.5
Claude Haiku
68.70%39.45%0.22%
Claude Opus 4.5
Claude Opus
76.7%86.0%77.47%75.96%45.89%25.8%73.9%20.69%
Claude Opus 4.6
Claude Opus
78.7%90.5%76.33%27.06%34.2%77.3%14970.51%69.17%40.7%94.4%46.5%
Claude Opus 4.6 thinking
Claude Opus
51.90%1499
Claude Sonnet 4.6
Claude Sonnet
75.2%87.4%75.47%31.56%21.07%75.6%145432.4%
Claude Opus 4.7
Claude Opus
83.5%90.2%76.91%54.20%39.04%14800.18%75.83%43.79%97.8%50.6%
Claude Opus 4.7 thinking
Claude Opus
1486
Claude Opus 4.8
Claude Opus
88.6%93.6%77.22%49.8%

Scores are not normalised across benchmarks. Each column uses its own metric. Compare columns independently.