evals.report
BenchmarksLabsCompareRun guides

METR Task-Completion Time Horizons

Measures the length of software/ML-engineering tasks (in human-expert minutes) that an AI agent can complete with 50% reliability, derived from a logistic fit over HCAST, RE-Bench, and SWAA task suites.

Agents50% time horizonHigher is better

What this benchmark measures

Measures the length of software/ML-engineering tasks (in human-expert minutes) that an AI agent can complete with 50% reliability, derived from a logistic fit over HCAST, RE-Bench, and SWAA task suites.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is 50% time horizon. It should be interpreted within METR Task-Completion Time Horizons, not compared as part of a site-wide ranking.

No composite ranking
evals.report never combines benchmarks. 50% time horizon on METR Task-Completion Time Horizons is its own number — don’t average it with other metrics.