Question 1

What is LiveBench?

Accepted Answer

A frequently updated public benchmark suite spanning reasoning, coding, math, language, and instruction-following tasks. It is a reasoning benchmark measured by score.

Question 2

What does score mean on LiveBench?

Accepted Answer

LiveBench reports score; higher is better. Scores are shown only within LiveBench and are never averaged with other benchmarks.

Question 3

What is the top reported LiveBench score?

Accepted Answer

GPT-5.5 has the top reported score on LiveBench: 80.71% (score).

Question 4

Why do LiveBench scores differ across runs?

Accepted Answer

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Question 5

Does evals.report rank models across benchmarks?

Accepted Answer

No. LiveBench scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

Model	Lab	Score↓	Source model	Status	Date
GPT-5.5	OpenAI	80.71%	gpt-5.5-xhigh	Official	Apr 23, 2026	Details
GPT-5.4	OpenAI	80.28%	gpt-5.4-xhigh	Official	Mar 5, 2026	Details
Gemini 3.1 Pro Preview	Google DeepMind	79.93%	gemini-3.1-pro-preview-high	Official	Feb 19, 2026	Details
Claude Opus 4.8	Anthropic	77.22%	claude-opus-4-8-xhigh-effort	Official	May 28, 2026	Details
Claude Opus 4.7	Anthropic	76.91%	claude-opus-4-7-xhigh-effort	Official	Apr 16, 2026	Details
Claude Opus 4.6	Anthropic	76.33%	claude-opus-4-6-thinking-auto-high-effort	Official	Feb 5, 2026	Details
Claude Opus 4.5	Anthropic	75.96%	claude-opus-4-5-20251101-thinking-64k-high-effort	Official	Nov 24, 2025	Details
Claude Sonnet 4.6	Anthropic	75.47%	claude-sonnet-4-6-thinking-auto-medium-effort	Official	Feb 17, 2026	Details
Gemini 3.5 Flash	Google DeepMind	75.02%	gemini-3.5-flash-high	Official	May 19, 2026	Details
GPT-5.2	OpenAI	74.84%	gpt-5.2-2025-12-11-high	Official	Dec 11, 2025	Details
Qwen3.7 Max Preview	Alibaba / Qwen	74.29%	qwen3.7-max	Official	May 14, 2026	Details
DeepSeek V4 ProOpen	DeepSeek	73.58%	deepseek-v4-pro	Official	Apr 24, 2026	Details
Gemini 3 Pro	Google DeepMind	73.39%	gemini-3-pro-preview-11-2025-high	Official	Nov 18, 2025	Details
Kimi K2.6Open	Moonshot AI	72.17%	kimi-k2.6-thinking	Official	Apr 20, 2026	Details
GLM-5.1Open	Z.ai	70.18%	glm-5.1	Official	Apr 7, 2026	Details
Grok 4.20 beta reasoning	xAI	67.96%	grok-4.20-beta-0309-reasoning	Official	Mar 9, 2026	Details