evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

MultiChallenge

A realistic multi-turn conversation benchmark by Scale AI (SEAL) that evaluates whether frontier LLMs can follow instructions, retain user information, perform versioned editing, and stay self-coherent across multiple conversational turns.

ReasoningaccuracyHigher is better

No run guide for this benchmark yet.