RULER

Name: RULER
Creator: evals.report
License: https://creativecommons.org/licenses/by/4.0/

A synthetic long-context benchmark of 13 tasks across retrieval, multi-hop tracing, aggregation, and QA that measures a model's effective context length by evaluating accuracy at increasing input lengths (4K to 128K+ tokens).

ReasoningaccuracyHigher is better

Scores About Run this benchmark

No run guide for this benchmark yet.