Question 1

What is METR Task-Completion Time Horizons?

Accepted Answer

Measures the length of software/ML-engineering tasks (in human-expert minutes) that an AI agent can complete with 50% reliability, derived from a logistic fit over HCAST, RE-Bench, and SWAA task suites. It is a agents benchmark measured by 50% time horizon.

Question 2

What does 50% time horizon mean on METR Task-Completion Time Horizons?

Accepted Answer

METR Task-Completion Time Horizons reports 50% time horizon (min); higher is better. Scores are shown only within METR Task-Completion Time Horizons and are never averaged with other benchmarks.

Question 3

What is the top reported METR Task-Completion Time Horizons score?

Accepted Answer

Claude Mythos Preview has the top reported score on METR Task-Completion Time Horizons: 1044.8 min (50% time horizon).

Question 4

Why do METR Task-Completion Time Horizons scores differ across runs?

Accepted Answer

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Question 5

Does evals.report rank models across benchmarks?

Accepted Answer

No. METR Task-Completion Time Horizons scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".