evals.report
BenchmarksSourcesLabsCompareRun guides
LabsOpenAI

OpenAI

Model lab for OpenAI public benchmark rows.

19 models79 results openai.com

Models 19

Progress by benchmark

Show progress on
GPT-4o
May 13, 2024
GPT-4.1
Apr 14, 2025
o3
Apr 16, 2025
62.3%
o4-mini (high)
Apr 16, 2025
o4-mini
Apr 16, 2025
GPT-OSS-120B
Aug 5, 2025
GPT-5
Aug 7, 2025
GPT-5 high
Aug 7, 2025
73.6%
GPT-5 mini
Aug 7, 2025
64.7%
GPT-5.1
Nov 13, 2025
68.0%
GPT-5.2
Dec 11, 2025
73.8%
GPT-5.2-Codex
Dec 11, 2025
GPT-5.3-Codex
Feb 19, 2026
74.8%
GPT-5.4
Mar 5, 2026
76.9%
GPT-5.4 xHigh
Mar 5, 2026
GPT-5.4 Pro
Mar 5, 2026
GPT-5.5
Apr 23, 2026
80.6%
GPT-5.5 high
Apr 23, 2026
GPT-5.5 Pro
Apr 23, 2026
Single benchmark only
This view shows SWE-bench Verified (% resolved) only. Other benchmarks use different metrics and are not directly comparable.

Progress matrix

ModelSWE-bench Verified
% resolved
GPQA Diamond
accuracy
LiveCodeBench Pro
Codeforces Elo
Berkeley Function Calling Leaderboard
accuracy
LiveBench
score
Terminal-Bench 2.1
task success
SWE-bench Pro
% resolved
DeepSWE
% resolved
Humanity's Last Exam
accuracy
MMMU-Pro
accuracy
LMArena
source-defined rating
ARC-AGI-3
accuracy
ARC-AGI-2
accuracy
FrontierMath
accuracy
AIME (OTIS Mock)
accuracy
SimpleQA Verified
accuracy
GPT-4o
GPT
GPT-4.1
GPT
53.96%
o3
o-series
62.3%63.05%76.4%6.53%18.69%53.0%
o4-mini (high)
o-series
20926.11%
o4-mini
o-series
53.24%24.83%
GPT-OSS-120B
GPT OSS
129916.20%88.9%
GPT-5
GPT
32.41%
GPT-5 high
GPT
73.6%86.2%217641.78%25.32%78.4%91.4%50.6%
GPT-5 mini
GPT
64.7%55.46%
GPT-5.1
GPT
68.0%87.6%226927.2%79.0%31.03%88.6%48.9%
GPT-5.2
GPT
73.8%91.4%239355.87%74.84%29.94%29.9%80.4%40.7%96.1%
GPT-5.2-Codex
GPT
41.04%
GPT-5.3-Codex
GPT
74.8%
GPT-5.4
GPT
76.9%55.53%40.28%82.1%14720.21%47.6%
GPT-5.4 xHigh
GPT
93.3%80.28%59.10%73.95%95.3%44.8%
GPT-5.4 Pro
GPT
94.6%83.33%50.0%47.8%
GPT-5.5
GPT
80.6%94.0%80.71%70.05%43.56%146385%51.7%100.0%63.1%
GPT-5.5 high
GPT
14680.43%
GPT-5.5 Pro
GPT
93.9%84.58%52.4%100.0%64.5%

Scores are not normalised across benchmarks. Each column uses its own metric. Compare columns independently.