evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

OpenAI-MRCR v2 (Multi-Round Coreference Resolution)

A long-context retrieval benchmark in which a model must locate and reproduce a specific instance (the i-th 'needle') of repeated similar requests buried in a long synthetic multi-turn conversation, scored on the 8-needle variant across context lengths up to 1M tokens.

Reasoningaccuracy (mean SequenceMatcher similarity)Higher is better

What this benchmark measures

A long-context retrieval benchmark in which a model must locate and reproduce a specific instance (the i-th 'needle') of repeated similar requests buried in a long synthetic multi-turn conversation, scored on the 8-needle variant across context lengths up to 1M tokens.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is accuracy (mean SequenceMatcher similarity). It should be interpreted within OpenAI-MRCR v2 (Multi-Round Coreference Resolution), not compared as part of a site-wide ranking.

No composite ranking
evals.report never combines benchmarks. accuracy (mean SequenceMatcher similarity) on OpenAI-MRCR v2 (Multi-Round Coreference Resolution) is its own number — don’t average it with other metrics.