BenchmarksReasoning
LiveBench
A frequently updated public benchmark suite spanning reasoning, coding, math, language, and instruction-following tasks.
ReasoningscoreHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| GPT-5.5 | OpenAI | 80.71% | gpt-5.5-xhigh | Official | Jan 8, 2026 | Details |
| GPT-5.4 xHigh | OpenAI | 80.28% | gpt-5.4-xhigh | Official | Jan 8, 2026 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 79.93% | gemini-3.1-pro-preview-high | Official | Jan 8, 2026 | Details |
| Claude Opus 4.8 | Anthropic | 77.22% | claude-opus-4-8-xhigh-effort | Official | Jan 8, 2026 | Details |
| Claude Opus 4.7 | Anthropic | 76.91% | claude-opus-4-7-xhigh-effort | Official | Jan 8, 2026 | Details |
| Claude Opus 4.6 | Anthropic | 76.33% | claude-opus-4-6-thinking-auto-high-effort | Official | Jan 8, 2026 | Details |
| Claude Opus 4.5 | Anthropic | 75.96% | claude-opus-4-5-20251101-thinking-64k-high-effort | Official | Jan 8, 2026 | Details |
| Claude Sonnet 4.6 | Anthropic | 75.47% | claude-sonnet-4-6-thinking-auto-medium-effort | Official | Jan 8, 2026 | Details |
| Gemini 3.5 Flash | Google DeepMind | 75.02% | gemini-3.5-flash-high | Official | Jan 8, 2026 | Details |
| GPT-5.2 | OpenAI | 74.84% | gpt-5.2-2025-12-11-high | Official | Jan 8, 2026 | Details |
| Qwen3.7 Max Preview | Alibaba / Qwen | 74.29% | qwen3.7-max | Official | Jan 8, 2026 | Details |
| DeepSeek V4 Pro | DeepSeek | 73.58% | deepseek-v4-pro | Official | Jan 8, 2026 | Details |
| Gemini 3 Pro | Google DeepMind | 73.39% | gemini-3-pro-preview-11-2025-high | Official | Jan 8, 2026 | Details |
| Kimi K2.6 | Moonshot AI | 72.17% | kimi-k2.6-thinking | Official | Jan 8, 2026 | Details |
| GLM-5.1 | Z.ai | 70.18% | glm-5.1 | Official | Jan 8, 2026 | Details |
| Grok 4.20 beta reasoning | xAI | 67.96% | grok-4.20-beta-0309-reasoning | Official | Jan 8, 2026 | Details |
Each row reports the model’s score on LiveBench. Click a row for the full run context.