evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

RULER

A synthetic long-context benchmark of 13 tasks across retrieval, multi-hop tracing, aggregation, and QA that measures a model's effective context length by evaluating accuracy at increasing input lengths (4K to 128K+ tokens).

ReasoningaccuracyHigher is better
ModelLabScoreSource modelStatusDate
Gemini 1.5 ProGoogle DeepMind94.4%OfficialFeb 15, 2024Details
Mistral LargeMistral AI48.1%OfficialFeb 26, 2024Details

Each row reports the model’s accuracy on RULER. Click a row for the full run context.