LongBench v2
A long-context benchmark of 503 challenging multiple-choice questions with contexts from 8k to 2M words across six task categories, designed to test deep understanding and reasoning over realistic long-context multitasks.
What this benchmark measures
A long-context benchmark of 503 challenging multiple-choice questions with contexts from 8k to 2M words across six task categories, designed to test deep understanding and reasoning over realistic long-context multitasks.
Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.
The metric shown here is accuracy. It should be interpreted within LongBench v2, not compared as part of a site-wide ranking.