BenchmarksReasoning
IFBench
Ai2's instruction-following benchmark that measures precise instruction-following generalization on 58 diverse, verifiable out-of-domain output constraints designed to test whether models can obey novel rules rather than overfit to familiar constraint templates.
ReasoningaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Grok 4.3 | xAI | 83.3% | — | Official | Apr 17, 2026 | Details |
| Qwen3.7 Max Preview | Alibaba / Qwen | 80.5% | — | Official | May 14, 2026 | Details |
| MiMo-V2.5-Pro | Xiaomi | 79.9% | — | Official | Apr 22, 2026 | Details |
| DeepSeek V4 Flash | DeepSeek | 79.2% | — | Official | Apr 24, 2026 | Details |
| Amazon Nova 2 Pro | Amazon | 79.0% | — | Official | Dec 2, 2025 | Details |
| Qwen3.5-397B-A17B | Alibaba / Qwen | 78.8% | — | Official | Feb 16, 2026 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 77.1% | — | Official | Feb 19, 2026 | Details |
| DeepSeek V4 Pro | DeepSeek | 76.5% | — | Official | Apr 24, 2026 | Details |
| Gemini 3.5 Flash | Google DeepMind | 76.3% | — | Official | May 19, 2026 | Details |
| GLM-5.1 | Z.ai | 76.3% | — | Official | Apr 7, 2026 | Details |
| Kimi K2.6 | Moonshot AI | 76.0% | — | Official | Apr 20, 2026 | Details |
| Muse Spark | Meta | 75.9% | — | Official | Apr 8, 2026 | Details |
| GPT-5.5 | OpenAI | 75.9% | — | Official | Apr 23, 2026 | Details |
| MiniMax M2.7 | MiniMax | 75.7% | — | Official | Mar 18, 2026 | Details |
| GPT-5.4 | OpenAI | 73.9% | — | Official | Mar 5, 2026 | Details |
| NVIDIA Nemotron 3 Super 120B-A12B | NVIDIA | 71.5% | — | Official | Mar 10, 2026 | Details |
| o3 | OpenAI | 69.3% | — | Official | Apr 16, 2025 | Details |
| GPT-OSS-120B | OpenAI | 69.0% | — | Official | Aug 5, 2025 | Details |
| Mistral Medium 3.5 | Mistral AI | 68.8% | — | Official | Apr 28, 2026 | Details |
| Claude Opus 4.8 | Anthropic | 62.2% | — | Official | May 28, 2026 | Details |
| Claude Opus 4.7 | Anthropic | 58.6% | — | Official | Apr 16, 2026 | Details |
| Claude Sonnet 4.6 | Anthropic | 56.6% | — | Official | Feb 17, 2026 | Details |
| Claude Haiku 4.5 | Anthropic | 54.3% | — | Official | Oct 15, 2025 | Details |
| Gemini 2.5 Pro | Google DeepMind | 52.3% | — | Official | Mar 25, 2025 | Details |
| Claude Sonnet 4 | Anthropic | 42.3% | — | Official | May 22, 2025 | Details |
| DeepSeek R1 | DeepSeek | 38.0% | — | Official | Jan 20, 2025 | Details |
Each row reports the model’s accuracy on IFBench. Click a row for the full run context.