evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

LongBench v2

A long-context benchmark of 503 challenging multiple-choice questions with contexts from 8k to 2M words across six task categories, designed to test deep understanding and reasoning over realistic long-context multitasks.

ReasoningaccuracyHigher is better

No run guide for this benchmark yet.