BenchmarksReasoning
MultiChallenge
A realistic multi-turn conversation benchmark by Scale AI (SEAL) that evaluates whether frontier LLMs can follow instructions, retain user information, perform versioned editing, and stay self-coherent across multiple conversational turns.
ReasoningaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Muse Spark | Meta | 75.52% | — | Verified | Apr 8, 2026 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 71.37% | — | Verified | Feb 19, 2026 | Details |
| GPT-5.4 Pro | OpenAI | 69.23% | — | Verified | Mar 5, 2026 | Details |
| Gemini 3 Pro | Google DeepMind | 65.67% | — | Verified | Nov 18, 2025 | Details |
| GPT-5.1 | OpenAI | 63.41% | — | Verified | Nov 12, 2025 | Details |
| GPT-5 | OpenAI | 63.19% | — | Verified | Aug 7, 2025 | Details |
| OpenAI o3-pro | OpenAI | 62.40% | — | Verified | Jun 10, 2025 | Details |
| Kimi K2.5 | Moonshot AI | 61.39% | — | Verified | Jan 27, 2026 | Details |
| GPT-5 mini | OpenAI | 58.99% | — | Verified | Aug 7, 2025 | Details |
| Claude Opus 4.5 | Anthropic | 58.97% | — | Verified | Nov 24, 2025 | Details |
| Claude Opus 4 | Anthropic | 58.62% | — | Verified | May 22, 2025 | Details |
| Claude Opus 4.1 | Anthropic | 57.20% | — | Verified | Aug 5, 2025 | Details |
| Claude Sonnet 4 | Anthropic | 57.11% | — | Verified | May 22, 2025 | Details |
| o3 | OpenAI | 56.62% | — | Verified | Apr 16, 2025 | Details |
| Claude Opus 4.6 | Anthropic | 56.02% | — | Verified | Feb 5, 2026 | Details |
| Kimi K2 Thinking | Moonshot AI | 55.42% | — | Verified | Nov 6, 2025 | Details |
| Claude Sonnet 4.5 | Anthropic | 55.32% | — | Verified | Sep 29, 2025 | Details |
| Gemini 2.5 Pro | Google DeepMind | 53.62% | — | Verified | Mar 25, 2025 | Details |
| Claude 3.7 Sonnet | Anthropic | 51.58% | — | Verified | Feb 24, 2025 | Details |
| Claude Haiku 4.5 | Anthropic | 50.49% | — | Verified | Oct 15, 2025 | Details |
| DeepSeek V3.1 | DeepSeek | 46.10% | — | Verified | Aug 21, 2025 | Details |
| GPT-OSS-120B | OpenAI | 45.34% | — | Verified | Aug 5, 2025 | Details |
| o4-mini | OpenAI | 44.90% | — | Verified | Apr 16, 2025 | Details |
| Claude 3.5 Sonnet | Anthropic | 41.4% | — | Verified | Jun 20, 2024 | Details |
| GPT-4.1 | OpenAI | 39.43% | — | Verified | Apr 14, 2025 | Details |
| Gemini 2.0 Flash | Google DeepMind | 36.35% | — | Verified | Dec 11, 2024 | Details |
Each row reports the model’s accuracy on MultiChallenge. Click a row for the full run context.