evals.report
BenchmarksLabsCompareRun guides

AgentHarm

A safety benchmark of 440 malicious agentic tasks across 11 harm categories that measures how successfully an LLM agent completes harmful multi-step tool-use behaviors (harm score) and how often it refuses them (refusal rate).

AgentsHarm scoreLower is better
ModelLabScoreSource modelStatusDate
Mistral LargeMistral AI82.2%VerifiedFeb 26, 2024Details
GPT-4oOpenAI48.4%VerifiedMay 13, 2024Details
Gemini 1.5 ProGoogle DeepMind15.7%VerifiedFeb 15, 2024Details
Claude 3.5 SonnetAnthropic13.5%VerifiedJun 20, 2024Details
Llama 3.1 405BMeta4.3%VerifiedJul 23, 2024Details

Each row reports the model’s Harm score on AgentHarm. Click a row for the full run context.