evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

RULER

A synthetic long-context benchmark of 13 tasks across retrieval, multi-hop tracing, aggregation, and QA that measures a model's effective context length by evaluating accuracy at increasing input lengths (4K to 128K+ tokens).

ReasoningaccuracyHigher is better

No run guide for this benchmark yet.