RULER

Name: RULER
Creator: evals.report
License: https://creativecommons.org/licenses/by/4.0/

A synthetic long-context benchmark of 13 tasks across retrieval, multi-hop tracing, aggregation, and QA that measures a model's effective context length by evaluating accuracy at increasing input lengths (4K to 128K+ tokens).

ReasoningaccuracyHigher is better

Scores About Run this benchmark

Model	Lab	Score↓	Source model	Status	Date
Gemini 1.5 Pro	Google DeepMind	94.4%	—	Official	Feb 15, 2024	Details
Mistral Large	Mistral AI	48.1%	—	Official	Feb 26, 2024	Details

Each row reports the model’s accuracy on RULER. Click a row for the full run context.