evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

IFBench

Ai2's instruction-following benchmark that measures precise instruction-following generalization on 58 diverse, verifiable out-of-domain output constraints designed to test whether models can obey novel rules rather than overfit to familiar constraint templates.

ReasoningaccuracyHigher is better
ModelLabScoreSource modelStatusDate
Grok 4.3xAI83.3%OfficialApr 17, 2026Details
Qwen3.7 Max PreviewAlibaba / Qwen80.5%OfficialMay 14, 2026Details
MiMo-V2.5-ProXiaomi79.9%OfficialApr 22, 2026Details
DeepSeek V4 FlashDeepSeek79.2%OfficialApr 24, 2026Details
Amazon Nova 2 ProAmazon79.0%OfficialDec 2, 2025Details
Qwen3.5-397B-A17BAlibaba / Qwen78.8%OfficialFeb 16, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind77.1%OfficialFeb 19, 2026Details
DeepSeek V4 ProDeepSeek76.5%OfficialApr 24, 2026Details
Gemini 3.5 FlashGoogle DeepMind76.3%OfficialMay 19, 2026Details
GLM-5.1Z.ai76.3%OfficialApr 7, 2026Details
Kimi K2.6Moonshot AI76.0%OfficialApr 20, 2026Details
Muse SparkMeta75.9%OfficialApr 8, 2026Details
GPT-5.5OpenAI75.9%OfficialApr 23, 2026Details
MiniMax M2.7MiniMax75.7%OfficialMar 18, 2026Details
GPT-5.4OpenAI73.9%OfficialMar 5, 2026Details
NVIDIA Nemotron 3 Super 120B-A12BNVIDIA71.5%OfficialMar 10, 2026Details
o3OpenAI69.3%OfficialApr 16, 2025Details
GPT-OSS-120BOpenAI69.0%OfficialAug 5, 2025Details
Mistral Medium 3.5Mistral AI68.8%OfficialApr 28, 2026Details
Claude Opus 4.8Anthropic62.2%OfficialMay 28, 2026Details
Claude Opus 4.7Anthropic58.6%OfficialApr 16, 2026Details
Claude Sonnet 4.6Anthropic56.6%OfficialFeb 17, 2026Details
Claude Haiku 4.5Anthropic54.3%OfficialOct 15, 2025Details
Gemini 2.5 ProGoogle DeepMind52.3%OfficialMar 25, 2025Details
Claude Sonnet 4Anthropic42.3%OfficialMay 22, 2025Details
DeepSeek R1DeepSeek38.0%OfficialJan 20, 2025Details

Each row reports the model’s accuracy on IFBench. Click a row for the full run context.