BenchmarksAgents
METR Task-Completion Time Horizons
Measures the length of software/ML-engineering tasks (in human-expert minutes) that an AI agent can complete with 50% reliability, derived from a logistic fit over HCAST, RE-Bench, and SWAA task suites.
Agents50% time horizonHigher is better
No run guide for this benchmark yet.