BenchmarksReasoning
MultiChallenge
A realistic multi-turn conversation benchmark by Scale AI (SEAL) that evaluates whether frontier LLMs can follow instructions, retain user information, perform versioned editing, and stay self-coherent across multiple conversational turns.
ReasoningaccuracyHigher is better
No run guide for this benchmark yet.