evals.report
BenchmarksLabsCompareRun guides
BenchmarksTool use

τ²-bench (Telecom)

A dual-control, multi-turn tool-agent-user benchmark (telecom split) where both the AI agent and a simulated user invoke tools to coordinate and resolve technical-support troubleshooting tasks in a shared, dynamic environment.

Tool usepass^1Higher is better

No run guide for this benchmark yet.