BenchmarksReasoning
RULER
A synthetic long-context benchmark of 13 tasks across retrieval, multi-hop tracing, aggregation, and QA that measures a model's effective context length by evaluating accuracy at increasing input lengths (4K to 128K+ tokens).
ReasoningaccuracyHigher is better
No run guide for this benchmark yet.