evals.report
BenchmarksLabsCompareRun guides

AgentHarm

A safety benchmark of 440 malicious agentic tasks across 11 harm categories that measures how successfully an LLM agent completes harmful multi-step tool-use behaviors (harm score) and how often it refuses them (refusal rate).

AgentsHarm scoreLower is better

No run guide for this benchmark yet.