evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

LongBench v2

A long-context benchmark of 503 challenging multiple-choice questions with contexts from 8k to 2M words across six task categories, designed to test deep understanding and reasoning over realistic long-context multitasks.

ReasoningaccuracyHigher is better
ModelLabScoreSource modelStatusDate
Gemini 2.5 ProGoogle DeepMind63.3%OfficialMar 25, 2025Details
Gemini 2.5 FlashGoogle DeepMind62.1%OfficialApr 17, 2025Details
Qwen3 235B A22B Instruct 2507Alibaba / Qwen58.3%OfficialJul 21, 2025Details
DeepSeek R1DeepSeek58.3%OfficialJan 20, 2025Details
GPT-4oOpenAI51.4%OfficialMay 13, 2024Details
Gemini 2.0 FlashGoogle DeepMind51.1%OfficialDec 11, 2024Details
Claude 3.5 SonnetAnthropic46.7%OfficialJun 20, 2024Details
Kimi K2 InstructMoonshot AI44.3%OfficialJul 11, 2025Details
Mistral LargeMistral AI39.6%OfficialFeb 26, 2024Details

Each row reports the model’s accuracy on LongBench v2. Click a row for the full run context.