evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

MultiChallenge

A realistic multi-turn conversation benchmark by Scale AI (SEAL) that evaluates whether frontier LLMs can follow instructions, retain user information, perform versioned editing, and stay self-coherent across multiple conversational turns.

ReasoningaccuracyHigher is better
ModelLabScoreSource modelStatusDate
Muse SparkMeta75.52%VerifiedApr 8, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind71.37%VerifiedFeb 19, 2026Details
GPT-5.4 ProOpenAI69.23%VerifiedMar 5, 2026Details
Gemini 3 ProGoogle DeepMind65.67%VerifiedNov 18, 2025Details
GPT-5.1OpenAI63.41%VerifiedNov 12, 2025Details
GPT-5OpenAI63.19%VerifiedAug 7, 2025Details
OpenAI o3-proOpenAI62.40%VerifiedJun 10, 2025Details
Kimi K2.5Moonshot AI61.39%VerifiedJan 27, 2026Details
GPT-5 miniOpenAI58.99%VerifiedAug 7, 2025Details
Claude Opus 4.5Anthropic58.97%VerifiedNov 24, 2025Details
Claude Opus 4Anthropic58.62%VerifiedMay 22, 2025Details
Claude Opus 4.1Anthropic57.20%VerifiedAug 5, 2025Details
Claude Sonnet 4Anthropic57.11%VerifiedMay 22, 2025Details
o3OpenAI56.62%VerifiedApr 16, 2025Details
Claude Opus 4.6Anthropic56.02%VerifiedFeb 5, 2026Details
Kimi K2 ThinkingMoonshot AI55.42%VerifiedNov 6, 2025Details
Claude Sonnet 4.5Anthropic55.32%VerifiedSep 29, 2025Details
Gemini 2.5 ProGoogle DeepMind53.62%VerifiedMar 25, 2025Details
Claude 3.7 SonnetAnthropic51.58%VerifiedFeb 24, 2025Details
Claude Haiku 4.5Anthropic50.49%VerifiedOct 15, 2025Details
DeepSeek V3.1DeepSeek46.10%VerifiedAug 21, 2025Details
GPT-OSS-120BOpenAI45.34%VerifiedAug 5, 2025Details
o4-miniOpenAI44.90%VerifiedApr 16, 2025Details
Claude 3.5 SonnetAnthropic41.4%VerifiedJun 20, 2024Details
GPT-4.1OpenAI39.43%VerifiedApr 14, 2025Details
Gemini 2.0 FlashGoogle DeepMind36.35%VerifiedDec 11, 2024Details

Each row reports the model’s accuracy on MultiChallenge. Click a row for the full run context.