BenchmarksReasoning
LongBench v2
A long-context benchmark of 503 challenging multiple-choice questions with contexts from 8k to 2M words across six task categories, designed to test deep understanding and reasoning over realistic long-context multitasks.
ReasoningaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Gemini 2.5 Pro | Google DeepMind | 63.3% | — | Official | Mar 25, 2025 | Details |
| Gemini 2.5 Flash | Google DeepMind | 62.1% | — | Official | Apr 17, 2025 | Details |
| Qwen3 235B A22B Instruct 2507 | Alibaba / Qwen | 58.3% | — | Official | Jul 21, 2025 | Details |
| DeepSeek R1 | DeepSeek | 58.3% | — | Official | Jan 20, 2025 | Details |
| GPT-4o | OpenAI | 51.4% | — | Official | May 13, 2024 | Details |
| Gemini 2.0 Flash | Google DeepMind | 51.1% | — | Official | Dec 11, 2024 | Details |
| Claude 3.5 Sonnet | Anthropic | 46.7% | — | Official | Jun 20, 2024 | Details |
| Kimi K2 Instruct | Moonshot AI | 44.3% | — | Official | Jul 11, 2025 | Details |
| Mistral Large | Mistral AI | 39.6% | — | Official | Feb 26, 2024 | Details |
Each row reports the model’s accuracy on LongBench v2. Click a row for the full run context.