evals.report
BenchmarksLabsCompareRun guides

METR Task-Completion Time Horizons

Measures the length of software/ML-engineering tasks (in human-expert minutes) that an AI agent can complete with 50% reliability, derived from a logistic fit over HCAST, RE-Bench, and SWAA task suites.

Agents50% time horizonHigher is better

No run guide for this benchmark yet.