evals.report
BenchmarksLabsCompareRun guides
BenchmarksTool use

τ²-bench (Telecom)

A dual-control, multi-turn tool-agent-user benchmark (telecom split) where both the AI agent and a simulated user invoke tools to coordinate and resolve technical-support troubleshooting tasks in a shared, dynamic environment.

Tool usepass^1Higher is better
ModelLabScoreSource modelStatusDate
Claude Opus 4.6Anthropic99.3%UnverifiedFeb 5, 2026Details
GPT-5.2OpenAI98.7%UnverifiedDec 11, 2025Details
Claude Opus 4.5Anthropic98.2%UnverifiedNov 24, 2025Details
Grok 4.3xAI97.7%OfficialApr 17, 2026Details
GLM-5.1Z.ai97.7%OfficialApr 7, 2026Details
GPT-5OpenAI96.7%UnverifiedAug 7, 2025Details
DeepSeek V4 ProDeepSeek96.2%OfficialApr 24, 2026Details
Kimi K2.6Moonshot AI95.9%OfficialApr 20, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind95.6%OfficialFeb 19, 2026Details
Qwen3.5-397B-A17BAlibaba / Qwen95.6%OfficialFeb 16, 2026Details
GPT-5.1OpenAI95.6%UnverifiedNov 12, 2025Details
Gemini 3.5 FlashGoogle DeepMind95.3%OfficialMay 19, 2026Details
DeepSeek V4 FlashDeepSeek95.0%OfficialApr 24, 2026Details
Qwen3.7 Max PreviewAlibaba / Qwen94.7%OfficialMay 14, 2026Details
Claude Opus 4.8Anthropic94.4%OfficialMay 28, 2026Details
Mistral Medium 3.5Mistral AI94.2%OfficialApr 28, 2026Details
MiMo-V2.5Xiaomi94.2%OfficialApr 22, 2026Details
GPT-5.5OpenAI93.9%OfficialApr 23, 2026Details
Amazon Nova 2 ProAmazon92.7%OfficialDec 2, 2025Details
Muse SparkMeta91.5%OfficialApr 8, 2026Details
Claude Opus 4.7Anthropic88.6%OfficialApr 16, 2026Details
GPT-5.4OpenAI87.1%OfficialMar 5, 2026Details
MiniMax M2.1MiniMax87.0%UnverifiedDec 23, 2025Details
Gemini 3 ProGoogle DeepMind85.4%UnverifiedNov 18, 2025Details
MiniMax M2.7MiniMax84.8%OfficialMar 18, 2026Details
Claude Sonnet 4.6Anthropic75.7%OfficialFeb 17, 2026Details
NVIDIA Nemotron 3 Super 120B-A12BNVIDIA67.8%OfficialMar 10, 2026Details
GPT-OSS-120BOpenAI65.8%OfficialAug 5, 2025Details
Kimi K2 InstructMoonshot AI65.8%UnverifiedJul 11, 2025Details
o3OpenAI58.2%UnverifiedApr 16, 2025Details
Claude Haiku 4.5Anthropic54.7%OfficialOct 15, 2025Details
Claude 3.7 SonnetAnthropic49%VerifiedFeb 24, 2025Details
o4-miniOpenAI42%VerifiedApr 16, 2025Details
GPT-4.1OpenAI34%VerifiedApr 14, 2025Details
GPT-4oOpenAI23.5%UnverifiedMay 13, 2024Details

Each row reports the model’s pass^1 on τ²-bench (Telecom). Click a row for the full run context.