AgentHarm

Name: AgentHarm
Creator: evals.report
License: https://creativecommons.org/licenses/by/4.0/

A safety benchmark of 440 malicious agentic tasks across 11 harm categories that measures how successfully an LLM agent completes harmful multi-step tool-use behaviors (harm score) and how often it refuses them (refusal rate).

AgentsHarm scoreLower is better

Scores About Run this benchmark

Model	Lab	Score↓	Source model	Status	Date
Mistral Large	Mistral AI	82.2%	—	Verified	Feb 26, 2024	Details
GPT-4o	OpenAI	48.4%	—	Verified	May 13, 2024	Details
Gemini 1.5 Pro	Google DeepMind	15.7%	—	Verified	Feb 15, 2024	Details
Claude 3.5 Sonnet	Anthropic	13.5%	—	Verified	Jun 20, 2024	Details
Llama 3.1 405B	Meta	4.3%	—	Verified	Jul 23, 2024	Details

Each row reports the model’s Harm score on AgentHarm. Click a row for the full run context.