BenchmarksAgents
AgentHarm
A safety benchmark of 440 malicious agentic tasks across 11 harm categories that measures how successfully an LLM agent completes harmful multi-step tool-use behaviors (harm score) and how often it refuses them (refusal rate).
AgentsHarm scoreLower is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Mistral Large | Mistral AI | 82.2% | — | Verified | Feb 26, 2024 | Details |
| GPT-4o | OpenAI | 48.4% | — | Verified | May 13, 2024 | Details |
| Gemini 1.5 Pro | Google DeepMind | 15.7% | — | Verified | Feb 15, 2024 | Details |
| Claude 3.5 Sonnet | Anthropic | 13.5% | — | Verified | Jun 20, 2024 | Details |
| Llama 3.1 405B | Meta | 4.3% | — | Verified | Jul 23, 2024 | Details |
Each row reports the model’s Harm score on AgentHarm. Click a row for the full run context.