evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

OpenAI-MRCR v2 (Multi-Round Coreference Resolution)

A long-context retrieval benchmark in which a model must locate and reproduce a specific instance (the i-th 'needle') of repeated similar requests buried in a long synthetic multi-turn conversation, scored on the 8-needle variant across context lengths up to 1M tokens.

Reasoningaccuracy (mean SequenceMatcher similarity)Higher is better
ModelLabScoreSource modelStatusDate
Claude Opus 4.6Anthropic93.0%VerifiedFeb 5, 2026Details
Gemini 3 ProGoogle DeepMind77.0%VerifiedNov 18, 2025Details
GPT-5.5OpenAI74.0%VerifiedApr 23, 2026Details
GPT-5.1OpenAI61.6%UnverifiedNov 12, 2025Details
Gemini 2.5 ProGoogle DeepMind58.0%VerifiedMar 25, 2025Details
Claude Sonnet 4.5Anthropic47.1%UnverifiedSep 29, 2025Details
Gemini 3.5 FlashGoogle DeepMind26.6%VerifiedMay 19, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind26.3%UnverifiedFeb 19, 2026Details
Gemini 3 FlashGoogle DeepMind22.1%VerifiedDec 17, 2025Details

Each row reports the model’s accuracy (mean SequenceMatcher similarity) on OpenAI-MRCR v2 (Multi-Round Coreference Resolution). Click a row for the full run context.