evals.report
BenchmarksLabsCompareRun guides
BenchmarksTool use

τ²-bench (Telecom)

A dual-control, multi-turn tool-agent-user benchmark (telecom split) where both the AI agent and a simulated user invoke tools to coordinate and resolve technical-support troubleshooting tasks in a shared, dynamic environment.

Tool usepass^1Higher is better

What this benchmark measures

A dual-control, multi-turn tool-agent-user benchmark (telecom split) where both the AI agent and a simulated user invoke tools to coordinate and resolve technical-support troubleshooting tasks in a shared, dynamic environment.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is pass^1. It should be interpreted within τ²-bench (Telecom), not compared as part of a site-wide ranking.

No composite ranking
evals.report never combines benchmarks. pass^1 on τ²-bench (Telecom) is its own number — don’t average it with other metrics.