Question 1

What is METR Task-Completion Time Horizons?

Accepted Answer

Measures the length of software/ML-engineering tasks (in human-expert minutes) that an AI agent can complete with 50% reliability, derived from a logistic fit over HCAST, RE-Bench, and SWAA task suites. It is a agents benchmark measured by 50% time horizon.

Question 2

What does 50% time horizon mean on METR Task-Completion Time Horizons?

Accepted Answer

METR Task-Completion Time Horizons reports 50% time horizon (min); higher is better. Scores are shown only within METR Task-Completion Time Horizons and are never averaged with other benchmarks.

Question 3

What is the top reported METR Task-Completion Time Horizons score?

Accepted Answer

Claude Mythos Preview has the top reported score on METR Task-Completion Time Horizons: 1044.8 min (50% time horizon).

Question 4

Why do METR Task-Completion Time Horizons scores differ across runs?

Accepted Answer

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Question 5

Does evals.report rank models across benchmarks?

Accepted Answer

No. METR Task-Completion Time Horizons scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

Model	Lab	Score↓	Source model	Status	Date
Claude Mythos Preview	Anthropic	1044.8 min	—	Official	Apr 7, 2026	Details
Claude Opus 4.6	Anthropic	718.8 min	—	Official	Feb 5, 2026	Details
Gemini 3.1 Pro Preview	Google DeepMind	384.1 min	—	Official	Feb 19, 2026	Details
GPT-5.2	OpenAI	352.2 min	—	Official	Dec 11, 2025	Details
GPT-5.3-Codex	OpenAI	349.5 min	—	Official	Feb 5, 2026	Details
GPT-5.4	OpenAI	341.7 min	—	Official	Mar 5, 2026	Details
Claude Opus 4.5	Anthropic	293.0 min	—	Official	Nov 24, 2025	Details
Gemini 3 Pro	Google DeepMind	224.3 min	—	Official	Nov 18, 2025	Details
GPT-5	OpenAI	203.0 min	—	Official	Aug 7, 2025	Details
Claude Sonnet 4.5	Anthropic	122 min	—	Official	Sep 29, 2025	Details
o3	OpenAI	119.7 min	—	Official	Apr 16, 2025	Details
Grok 4	xAI	109 min	—	Official	Jul 9, 2025	Details
Claude Opus 4.1	Anthropic	100.5 min	—	Official	Aug 5, 2025	Details
Claude Opus 4	Anthropic	100.4 min	—	Official	May 22, 2025	Details
Claude Sonnet 4	Anthropic	75 min	—	Official	May 22, 2025	Details
Claude 3.7 Sonnet	Anthropic	60.4 min	—	Official	Feb 24, 2025	Details
Claude 3.5 Sonnet	Anthropic	11.4 min	—	Official	Jun 20, 2024	Details
GPT-4o	OpenAI	7.0 min	—	Official	May 13, 2024	Details