evals.report
BenchmarksLabsCompareRun guides

METR Task-Completion Time Horizons

Measures the length of software/ML-engineering tasks (in human-expert minutes) that an AI agent can complete with 50% reliability, derived from a logistic fit over HCAST, RE-Bench, and SWAA task suites.

Agents50% time horizonHigher is better
ModelLabScoreSource modelStatusDate
Claude Mythos PreviewAnthropic1044.8 minOfficialApr 7, 2026Details
Claude Opus 4.6Anthropic718.8 minOfficialFeb 5, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind384.1 minOfficialFeb 19, 2026Details
GPT-5.2OpenAI352.2 minOfficialDec 11, 2025Details
GPT-5.3-CodexOpenAI349.5 minOfficialFeb 5, 2026Details
GPT-5.4OpenAI341.7 minOfficialMar 5, 2026Details
Claude Opus 4.5Anthropic293.0 minOfficialNov 24, 2025Details
Gemini 3 ProGoogle DeepMind224.3 minOfficialNov 18, 2025Details
GPT-5OpenAI203.0 minOfficialAug 7, 2025Details
Claude Sonnet 4.5Anthropic122 minOfficialSep 29, 2025Details
o3OpenAI119.7 minOfficialApr 16, 2025Details
Grok 4xAI109 minOfficialJul 9, 2025Details
Claude Opus 4.1Anthropic100.5 minOfficialAug 5, 2025Details
Claude Opus 4Anthropic100.4 minOfficialMay 22, 2025Details
Claude Sonnet 4Anthropic75 minOfficialMay 22, 2025Details
Claude 3.7 SonnetAnthropic60.4 minOfficialFeb 24, 2025Details
Claude 3.5 SonnetAnthropic11.4 minOfficialJun 20, 2024Details
GPT-4oOpenAI7.0 minOfficialMay 13, 2024Details

Each row reports the model’s 50% time horizon on METR Task-Completion Time Horizons. Click a row for the full run context.