BenchmarksAgents
METR Task-Completion Time Horizons
Measures the length of software/ML-engineering tasks (in human-expert minutes) that an AI agent can complete with 50% reliability, derived from a logistic fit over HCAST, RE-Bench, and SWAA task suites.
Agents50% time horizonHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Mythos Preview | Anthropic | 1044.8 min | — | Official | Apr 7, 2026 | Details |
| Claude Opus 4.6 | Anthropic | 718.8 min | — | Official | Feb 5, 2026 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 384.1 min | — | Official | Feb 19, 2026 | Details |
| GPT-5.2 | OpenAI | 352.2 min | — | Official | Dec 11, 2025 | Details |
| GPT-5.3-Codex | OpenAI | 349.5 min | — | Official | Feb 5, 2026 | Details |
| GPT-5.4 | OpenAI | 341.7 min | — | Official | Mar 5, 2026 | Details |
| Claude Opus 4.5 | Anthropic | 293.0 min | — | Official | Nov 24, 2025 | Details |
| Gemini 3 Pro | Google DeepMind | 224.3 min | — | Official | Nov 18, 2025 | Details |
| GPT-5 | OpenAI | 203.0 min | — | Official | Aug 7, 2025 | Details |
| Claude Sonnet 4.5 | Anthropic | 122 min | — | Official | Sep 29, 2025 | Details |
| o3 | OpenAI | 119.7 min | — | Official | Apr 16, 2025 | Details |
| Grok 4 | xAI | 109 min | — | Official | Jul 9, 2025 | Details |
| Claude Opus 4.1 | Anthropic | 100.5 min | — | Official | Aug 5, 2025 | Details |
| Claude Opus 4 | Anthropic | 100.4 min | — | Official | May 22, 2025 | Details |
| Claude Sonnet 4 | Anthropic | 75 min | — | Official | May 22, 2025 | Details |
| Claude 3.7 Sonnet | Anthropic | 60.4 min | — | Official | Feb 24, 2025 | Details |
| Claude 3.5 Sonnet | Anthropic | 11.4 min | — | Official | Jun 20, 2024 | Details |
| GPT-4o | OpenAI | 7.0 min | — | Official | May 13, 2024 | Details |
Each row reports the model’s 50% time horizon on METR Task-Completion Time Horizons. Click a row for the full run context.