evals.report
BenchmarksSourcesLabsCompareRun guides
BenchmarksReasoning

LiveBench

A frequently updated public benchmark suite spanning reasoning, coding, math, language, and instruction-following tasks.

ReasoningscoreHigher is better
ModelLabScoreSource modelStatusDate
GPT-5.5OpenAI80.71%gpt-5.5-xhighOfficialJan 8, 2026Details
GPT-5.4 xHighOpenAI80.28%gpt-5.4-xhighOfficialJan 8, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind79.93%gemini-3.1-pro-preview-highOfficialJan 8, 2026Details
Claude Opus 4.8Anthropic77.22%claude-opus-4-8-xhigh-effortOfficialJan 8, 2026Details
Claude Opus 4.7Anthropic76.91%claude-opus-4-7-xhigh-effortOfficialJan 8, 2026Details
Claude Opus 4.6Anthropic76.33%claude-opus-4-6-thinking-auto-high-effortOfficialJan 8, 2026Details
Claude Opus 4.5Anthropic75.96%claude-opus-4-5-20251101-thinking-64k-high-effortOfficialJan 8, 2026Details
Claude Sonnet 4.6Anthropic75.47%claude-sonnet-4-6-thinking-auto-medium-effortOfficialJan 8, 2026Details
Gemini 3.5 FlashGoogle DeepMind75.02%gemini-3.5-flash-highOfficialJan 8, 2026Details
GPT-5.2OpenAI74.84%gpt-5.2-2025-12-11-highOfficialJan 8, 2026Details
Qwen3.7 Max PreviewAlibaba / Qwen74.29%qwen3.7-maxOfficialJan 8, 2026Details
DeepSeek V4 ProDeepSeek73.58%deepseek-v4-proOfficialJan 8, 2026Details
Gemini 3 ProGoogle DeepMind73.39%gemini-3-pro-preview-11-2025-highOfficialJan 8, 2026Details
Kimi K2.6Moonshot AI72.17%kimi-k2.6-thinkingOfficialJan 8, 2026Details
GLM-5.1Z.ai70.18%glm-5.1OfficialJan 8, 2026Details
Grok 4.20 beta reasoningxAI67.96%grok-4.20-beta-0309-reasoningOfficialJan 8, 2026Details

Each row reports the model’s score on LiveBench. Click a row for the full run context.