Remote Labor Index
The Remote Labor Index (RLI), from CAIS and Scale Labs, measures how often AI agents can complete real, economically valuable freelance projects (3D & CAD, architecture, graphic design, video, audio, data analysis, web apps, and more) at a quality a paying client would accept. Each of the 240 projects has a real client brief, input files, and a gold-standard deliverable from a paid professional; every AI deliverable is judged by human evaluators. The headline automation rate is the share of projects where the AI's work is judged at least as good as the human's.
What this benchmark measures
The Remote Labor Index (RLI), from CAIS and Scale Labs, measures how often AI agents can complete real, economically valuable freelance projects (3D & CAD, architecture, graphic design, video, audio, data analysis, web apps, and more) at a quality a paying client would accept. Each of the 240 projects has a real client brief, input files, and a gold-standard deliverable from a paid professional; every AI deliverable is judged by human evaluators. The headline automation rate is the share of projects where the AI's work is judged at least as good as the human's.
Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.
The metric shown here is automation rate. It should be interpreted within Remote Labor Index, not compared as part of a site-wide ranking.
What to be careful about
Human-judged, not automated (CAIS shows an LLM judge overshoots the newest models ~2.5-3x). Each entry pairs a model with a strong industry agent scaffold + computer use; Fable 5's 16.1% covers 218 of 240 projects (worst case 14.6%).
Frequently asked
What is Remote Labor Index?
The Remote Labor Index (RLI), from CAIS and Scale Labs, measures how often AI agents can complete real, economically valuable freelance projects (3D & CAD, architecture, graphic design, video, audio, data analysis, web apps, and more) at a quality a paying client would accept. Each of the 240 projects has a real client brief, input files, and a gold-standard deliverable from a paid professional; every AI deliverable is judged by human evaluators. The headline automation rate is the share of projects where the AI's work is judged at least as good as the human's. It is a agents benchmark measured by automation rate.
What does automation rate mean on Remote Labor Index?
Remote Labor Index reports automation rate (%); higher is better. Scores are shown only within Remote Labor Index and are never averaged with other benchmarks.
What is the top reported Remote Labor Index score?
Claude Fable 5 has the top reported score on Remote Labor Index: 16.1% (automation rate).
Why do Remote Labor Index scores differ across runs?
Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.
Does evals.report rank models across benchmarks?
No. Remote Labor Index scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".