evals.report
BenchmarksLabsCompareRun guides

AgentHarm

A safety benchmark of 440 malicious agentic tasks across 11 harm categories that measures how successfully an LLM agent completes harmful multi-step tool-use behaviors (harm score) and how often it refuses them (refusal rate).

AgentsHarm scoreLower is better

What this benchmark measures

A safety benchmark of 440 malicious agentic tasks across 11 harm categories that measures how successfully an LLM agent completes harmful multi-step tool-use behaviors (harm score) and how often it refuses them (refusal rate).

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is Harm score. It should be interpreted within AgentHarm, not compared as part of a site-wide ranking.

No composite ranking
evals.report never combines benchmarks. Harm score on AgentHarm is its own number — don’t average it with other metrics.