What does score mean on LiveBench?

LiveBench reports score; higher is better. Scores are shown only within LiveBench and are never averaged with other benchmarks.

What is the top reported LiveBench score?

GPT-5.5 has the top reported score on LiveBench: 80.71% (score).

Why do LiveBench scores differ across runs?

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Does evals.report rank models across benchmarks?

No. LiveBench scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

BenchmarksReasoning

LiveBench

A frequently updated public benchmark suite spanning reasoning, coding, math, language, and instruction-following tasks.

ReasoningscoreHigher is better

Scores About Run this benchmark

LiveBench is a contamination-free LLM benchmark with objective ground-truth scoring across six categories (reasoning, math, coding, language, data analysis, instruction following). The canonical harness (github.com/LiveBench/LiveBench) is pip-installable (pip install -e .) and runs locally against any OpenAI-compatible or supported API model: run_livebench.py performs generation, ground-truth judgment, and result display in one pass, producing an aggregate score plus per-category/per-task CSVs. When reporting a score, attach the exact harness commit, the --livebench-release-option date used, and the --bench-name subset, since the question set and metric differ by release.

Benchmark

LiveBench

Repository

github.com/LiveBench/LiveBench

Dataset

huggingface.co/datasets/livebench/model_judgment

Metric

score

1Install

shell

git clone https://github.com/LiveBench/LiveBench.git

shell

python -m venv .venv

shell

source .venv/bin/activate

shell

cd LiveBench

shell

pip install -e .

shell

cd livebench/code_runner && pip install -r requirements_eval.txt && cd ..

shell

export OPENAI_API_KEY=<your-key>

2Run evaluation

shell

python download_questions.py

shell

python run_livebench.py --model gpt-4o --bench-name live_bench/coding --livebench-release-option 2024-11-25

shell

python run_livebench.py --model gpt-4o --bench-name live_bench --mode parallel --parallel-requests 10 --livebench-release-option 2024-11-25

3Score output

shell

python gen_ground_truth_judgment.py --bench-name live_bench/coding --question-source jsonl

shell

python show_livebench_result.py --bench-name live_bench --model-list gpt-4o

4Expected output

run_livebench.py runs the full pipeline (generate answers -> ground-truth judgment -> show results) and prints the LiveBench leaderboard to the terminal. The category breakdown is written to all_groups.csv and the per-task breakdown to all_tasks.csv (under the livebench working directory). The headline metric is the overall LiveBench score (mean of category scores). Only compare scores computed on the same --livebench-release-option and the same --bench-name subset; do not mix scores across release dates or against the hosted leaderboard if a different release was used.

5Submit results

LiveBench reports results on its public leaderboard (livebench.ai); there is no required online submission to evaluate your own model locally (the maintainers also offer to evaluate models via a GitHub issue or email). To self-report, run the harness and cite: the harness commit/version, the exact --model and any --api-base used, the --livebench-release-option date, the --bench-name scope (full live_bench vs a category), and the resulting overall score plus all_groups.csv/all_tasks.csv. Keep the full scaffold (decode settings, parallelism flags) attached to any reported number.

Gotchas

The current release is 2025-04-25 but not all of its questions are public on Hugging Face; the README says to pass --livebench-release-option 2024-11-25 to every script to evaluate all categories on the most recent fully-public question set.

Run all scripts from inside the livebench/ subdirectory (cd livebench after the pip install, per the README Usage section), not the repo root; download_questions.py, run_livebench.py, gen_api_answer.py, gen_ground_truth_judgment.py and show_livebench_result.py live there and write data to livebench/data/<category>/question.jsonl.

Coding tasks need the extra deps (pip install -r livebench/code_runner/requirements_eval.txt), and agentic coding questions require Docker (docker --version must work) with task-specific images that can consume up to ~150GB; --mode parallel additionally requires tmux installed.

API key defaults to the OPENAI_API_KEY env var; for non-OpenAI or self-hosted endpoints set --api-base (OpenAI-compatible) and override the key var with --api-key-name or pass --api-key directly. Use --resume/--retry-failures (or scripts/rerun_failed_questions.py) to recover interrupted runs.