BenchmarksReasoning
OpenAI-MRCR v2 (Multi-Round Coreference Resolution)
A long-context retrieval benchmark in which a model must locate and reproduce a specific instance (the i-th 'needle') of repeated similar requests buried in a long synthetic multi-turn conversation, scored on the 8-needle variant across context lengths up to 1M tokens.
Reasoningaccuracy (mean SequenceMatcher similarity)Higher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Opus 4.6 | Anthropic | 93.0% | — | Verified | Feb 5, 2026 | Details |
| Gemini 3 Pro | Google DeepMind | 77.0% | — | Verified | Nov 18, 2025 | Details |
| GPT-5.5 | OpenAI | 74.0% | — | Verified | Apr 23, 2026 | Details |
| GPT-5.1 | OpenAI | 61.6% | — | Unverified | Nov 12, 2025 | Details |
| Gemini 2.5 Pro | Google DeepMind | 58.0% | — | Verified | Mar 25, 2025 | Details |
| Claude Sonnet 4.5 | Anthropic | 47.1% | — | Unverified | Sep 29, 2025 | Details |
| Gemini 3.5 Flash | Google DeepMind | 26.6% | — | Verified | May 19, 2026 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 26.3% | — | Unverified | Feb 19, 2026 | Details |
| Gemini 3 Flash | Google DeepMind | 22.1% | — | Verified | Dec 17, 2025 | Details |
Each row reports the model’s accuracy (mean SequenceMatcher similarity) on OpenAI-MRCR v2 (Multi-Round Coreference Resolution). Click a row for the full run context.