LabsOpenAI
Models 19
GPT-4o
GPT · gpt-4o
2024-05-13
0 results
GPT-4.1
GPT · gpt-4.1
2025-04-14
1 results
o3
o-series · o3
2025-04-16
6 results
o4-mini (high)
o-series · o4-mini
2025-04-16
2 results
o4-mini
o-series · o4-mini
2025-04-16
2 results
GPT-OSS-120B
GPT OSS · gpt-oss-120b
2025-08-05
3 results
GPT-5
GPT · gpt-5
2025-08-07
1 results
GPT-5 high
GPT · gpt-5-high
2025-08-07
8 results
GPT-5 mini
GPT · gpt-5-mini
2025-08-07
2 results
GPT-5.1
GPT · gpt-5.1
2025-11-13
8 results
GPT-5.2
GPT · gpt-5.2
2025-12-11
10 results
GPT-5.2-Codex
GPT · gpt-5.2-codex
2025-12-11
1 results
GPT-5.3-Codex
GPT · gpt-5.3-codex
2026-02-19
1 results
GPT-5.4
GPT · gpt-5.4
2026-03-05
7 results
GPT-5.4 xHigh
GPT · gpt-5.4 xhigh
2026-03-05
6 results
GPT-5.4 Pro
GPT · gpt-5.4 pro
2026-03-05
4 results
GPT-5.5
GPT · gpt-5.5
2026-04-23
10 results
GPT-5.5 high
GPT · gpt-5.5 high
2026-04-23
2 results
GPT-5.5 Pro
GPT · gpt-5.5 pro
2026-04-23
5 results
Progress by benchmark
Show progress on
GPT-4o
May 13, 2024
—
GPT-4.1
Apr 14, 2025
—
o3
Apr 16, 2025
62.3%
o4-mini (high)
Apr 16, 2025
—
o4-mini
Apr 16, 2025
—
GPT-OSS-120B
Aug 5, 2025
—
GPT-5
Aug 7, 2025
—
GPT-5 high
Aug 7, 2025
73.6%
GPT-5 mini
Aug 7, 2025
64.7%
GPT-5.1
Nov 13, 2025
68.0%
GPT-5.2
Dec 11, 2025
73.8%
GPT-5.2-Codex
Dec 11, 2025
—
GPT-5.3-Codex
Feb 19, 2026
74.8%
GPT-5.4
Mar 5, 2026
76.9%
GPT-5.4 xHigh
Mar 5, 2026
—
GPT-5.4 Pro
Mar 5, 2026
—
GPT-5.5
Apr 23, 2026
80.6%
GPT-5.5 high
Apr 23, 2026
—
GPT-5.5 Pro
Apr 23, 2026
—
Single benchmark only
This view shows SWE-bench Verified (% resolved) only. Other benchmarks use different metrics and are not directly comparable.
Progress matrix
| Model | SWE-bench Verified % resolved | GPQA Diamond accuracy | LiveCodeBench Pro Codeforces Elo | Berkeley Function Calling Leaderboard accuracy | LiveBench score | Terminal-Bench 2.1 task success | SWE-bench Pro % resolved | DeepSWE % resolved | Humanity's Last Exam accuracy | MMMU-Pro accuracy | LMArena source-defined rating | ARC-AGI-3 accuracy | ARC-AGI-2 accuracy | FrontierMath accuracy | AIME (OTIS Mock) accuracy | SimpleQA Verified accuracy |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-4o GPT | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
| GPT-4.1 GPT | — | — | — | 53.96% | — | — | — | — | — | — | — | — | — | — | — | — |
| o3 o-series | 62.3% | — | — | 63.05% | — | — | — | — | — | 76.4% | — | — | 6.53% | 18.69% | — | 53.0% |
| o4-mini (high) o-series | — | — | 2092 | — | — | — | — | — | — | — | — | — | 6.11% | — | — | — |
| o4-mini o-series | — | — | — | 53.24% | — | — | — | — | — | — | — | — | — | 24.83% | — | — |
| GPT-OSS-120B GPT OSS | — | — | 1299 | — | — | — | 16.20% | — | — | — | — | — | — | — | 88.9% | — |
| GPT-5 GPT | — | — | — | — | — | — | — | — | — | — | — | — | — | 32.41% | — | — |
| GPT-5 high GPT | 73.6% | 86.2% | 2176 | — | — | — | 41.78% | — | 25.32% | 78.4% | — | — | — | — | 91.4% | 50.6% |
| GPT-5 mini GPT | 64.7% | — | — | 55.46% | — | — | — | — | — | — | — | — | — | — | — | — |
| GPT-5.1 GPT | 68.0% | 87.6% | 2269 | — | — | — | — | — | 27.2% | 79.0% | — | — | — | 31.03% | 88.6% | 48.9% |
| GPT-5.2 GPT | 73.8% | 91.4% | 2393 | 55.87% | 74.84% | — | 29.94% | — | 29.9% | 80.4% | — | — | — | 40.7% | 96.1% | — |
| GPT-5.2-Codex GPT | — | — | — | — | — | — | 41.04% | — | — | — | — | — | — | — | — | — |
| GPT-5.3-Codex GPT | 74.8% | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
| GPT-5.4 GPT | 76.9% | — | — | — | — | — | — | 55.53% | 40.28% | 82.1% | 1472 | 0.21% | — | 47.6% | — | — |
| GPT-5.4 xHigh GPT | — | 93.3% | — | — | 80.28% | — | 59.10% | — | — | — | — | — | 73.95% | — | 95.3% | 44.8% |
| GPT-5.4 Pro GPT | — | 94.6% | — | — | — | — | — | — | — | — | — | — | 83.33% | 50.0% | — | 47.8% |
| GPT-5.5 GPT | 80.6% | 94.0% | — | — | 80.71% | — | — | 70.05% | 43.56% | — | 1463 | — | 85% | 51.7% | 100.0% | 63.1% |
| GPT-5.5 high GPT | — | — | — | — | — | — | — | — | — | — | 1468 | 0.43% | — | — | — | — |
| GPT-5.5 Pro GPT | — | 93.9% | — | — | — | — | — | — | — | — | — | — | 84.58% | 52.4% | 100.0% | 64.5% |
Scores are not normalised across benchmarks. Each column uses its own metric. Compare columns independently.