BenchmarksTool use
τ²-bench (Telecom)
A dual-control, multi-turn tool-agent-user benchmark (telecom split) where both the AI agent and a simulated user invoke tools to coordinate and resolve technical-support troubleshooting tasks in a shared, dynamic environment.
Tool usepass^1Higher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Opus 4.6 | Anthropic | 99.3% | — | Unverified | Feb 5, 2026 | Details |
| GPT-5.2 | OpenAI | 98.7% | — | Unverified | Dec 11, 2025 | Details |
| Claude Opus 4.5 | Anthropic | 98.2% | — | Unverified | Nov 24, 2025 | Details |
| Grok 4.3 | xAI | 97.7% | — | Official | Apr 17, 2026 | Details |
| GLM-5.1 | Z.ai | 97.7% | — | Official | Apr 7, 2026 | Details |
| GPT-5 | OpenAI | 96.7% | — | Unverified | Aug 7, 2025 | Details |
| DeepSeek V4 Pro | DeepSeek | 96.2% | — | Official | Apr 24, 2026 | Details |
| Kimi K2.6 | Moonshot AI | 95.9% | — | Official | Apr 20, 2026 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 95.6% | — | Official | Feb 19, 2026 | Details |
| Qwen3.5-397B-A17B | Alibaba / Qwen | 95.6% | — | Official | Feb 16, 2026 | Details |
| GPT-5.1 | OpenAI | 95.6% | — | Unverified | Nov 12, 2025 | Details |
| Gemini 3.5 Flash | Google DeepMind | 95.3% | — | Official | May 19, 2026 | Details |
| DeepSeek V4 Flash | DeepSeek | 95.0% | — | Official | Apr 24, 2026 | Details |
| Qwen3.7 Max Preview | Alibaba / Qwen | 94.7% | — | Official | May 14, 2026 | Details |
| Claude Opus 4.8 | Anthropic | 94.4% | — | Official | May 28, 2026 | Details |
| Mistral Medium 3.5 | Mistral AI | 94.2% | — | Official | Apr 28, 2026 | Details |
| MiMo-V2.5 | Xiaomi | 94.2% | — | Official | Apr 22, 2026 | Details |
| GPT-5.5 | OpenAI | 93.9% | — | Official | Apr 23, 2026 | Details |
| Amazon Nova 2 Pro | Amazon | 92.7% | — | Official | Dec 2, 2025 | Details |
| Muse Spark | Meta | 91.5% | — | Official | Apr 8, 2026 | Details |
| Claude Opus 4.7 | Anthropic | 88.6% | — | Official | Apr 16, 2026 | Details |
| GPT-5.4 | OpenAI | 87.1% | — | Official | Mar 5, 2026 | Details |
| MiniMax M2.1 | MiniMax | 87.0% | — | Unverified | Dec 23, 2025 | Details |
| Gemini 3 Pro | Google DeepMind | 85.4% | — | Unverified | Nov 18, 2025 | Details |
| MiniMax M2.7 | MiniMax | 84.8% | — | Official | Mar 18, 2026 | Details |
| Claude Sonnet 4.6 | Anthropic | 75.7% | — | Official | Feb 17, 2026 | Details |
| NVIDIA Nemotron 3 Super 120B-A12B | NVIDIA | 67.8% | — | Official | Mar 10, 2026 | Details |
| GPT-OSS-120B | OpenAI | 65.8% | — | Official | Aug 5, 2025 | Details |
| Kimi K2 Instruct | Moonshot AI | 65.8% | — | Unverified | Jul 11, 2025 | Details |
| o3 | OpenAI | 58.2% | — | Unverified | Apr 16, 2025 | Details |
| Claude Haiku 4.5 | Anthropic | 54.7% | — | Official | Oct 15, 2025 | Details |
| Claude 3.7 Sonnet | Anthropic | 49% | — | Verified | Feb 24, 2025 | Details |
| o4-mini | OpenAI | 42% | — | Verified | Apr 16, 2025 | Details |
| GPT-4.1 | OpenAI | 34% | — | Verified | Apr 14, 2025 | Details |
| GPT-4o | OpenAI | 23.5% | — | Unverified | May 13, 2024 | Details |
Each row reports the model’s pass^1 on τ²-bench (Telecom). Click a row for the full run context.