BenchmarksReasoning
RULER
A synthetic long-context benchmark of 13 tasks across retrieval, multi-hop tracing, aggregation, and QA that measures a model's effective context length by evaluating accuracy at increasing input lengths (4K to 128K+ tokens).
ReasoningaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Gemini 1.5 Pro | Google DeepMind | 94.4% | — | Official | Feb 15, 2024 | Details |
| Mistral Large | Mistral AI | 48.1% | — | Official | Feb 26, 2024 | Details |
Each row reports the model’s accuracy on RULER. Click a row for the full run context.