evals.report
BenchmarksSourcesLabsCompareRun guides
SourcesOther

HELM

Reproducible benchmark output and schemas across many scenarios.

NextGCS bucketReview neededRun guide readyMachine-readable
Official source

Source detail

Score source

Official docs publish raw results in public GCS bucket crfm-helm-public.

Run guide

Repo documents raw downloads and leaderboard reproduction.

How it can be used

Index releases and suites first, then show selected scenarios only.

Caveat

HELM is large and multi-metric. Do not collapse scenario metrics into one score.

Evidence links 2