evals.report
BenchmarksLabsCompareRun guides

KernelBench evaluates whether an LLM can write fast, correct CUDA/GPU kernels that replace PyTorch operators. You run three official scripts: generate samples from your model, evaluate the generated kernels on a real GPU, then compute the fast_p metric. Level 4 ('hard' / Hugging Face) has 20 tasks of whole-model architectures and ships NO baseline timings, so to score fast_1 you must first generate your own hardware-specific baseline times (which requires editing generate_baseline_time.py to include level 4 and to set the hardware name). Keep attached to any score: the level (=4), GPU/hardware identifier, the inference model_name/server_type, temperature, num samples per task, KernelBench commit, and the baseline file used.

Benchmark
KernelBench (Hard)
Dataset
huggingface.co/datasets/ScalingIntelligence/KernelBench
Metric
fast₁

1Install

shell
git clone https://github.com/ScalingIntelligence/KernelBench.git
shell
cd KernelBench
shell
uv sync
shell
uv sync --extra gpu
shell
cp .env.example .env   # then add your inference API key(s); KernelBench uses litellm

2Run evaluation

shell
# 1) Generate kernels from your model for Level 4 (hard / HuggingFace). Swap server_type/model_name for your provider.
uv run python scripts/generate_samples.py run_name=my_level4_run dataset_src=huggingface level=4 num_workers=50 server_type=deepseek model_name=deepseek-chat temperature=0
shell
# 2) Level 4 ships NO provided baselines: generate hardware-specific PyTorch baseline times first. NOTE: scripts/generate_baseline_time.py takes NO CLI args and hardcodes 'for level in [1, 2, 3]' plus a hardcoded hardware_name/precision in its __main__ block. To produce Level 4 baselines you MUST edit the source to include level 4 and set hardware_name to match the 'hardware=' you will pass in step 4. Then run it (CUDA required):
uv run python scripts/generate_baseline_time.py
shell
# 3) Evaluate the generated kernels on real GPU(s). Set num_gpu_devices to your GPU count.
uv run python scripts/eval_from_generations.py run_name=my_level4_run dataset_src=local level=4 num_gpu_devices=8 timeout=300

3Score output

shell
# Compute fast_p (includes fast_0/fast_1/fast_2). hardware + baseline must exactly match the baseline file you generated (README example: hardware=L40S_matx3 baseline=baseline_time_torch).
uv run python scripts/benchmark_eval_analysis.py run_name=my_level4_run level=4 hardware=<your_gpu_id> baseline=<your_baseline_name>

4Expected output

generate_samples.py writes generated kernel code per task under runs/<run_name>/. eval_from_generations.py compiles and runs each kernel on the GPU, producing per-task correctness and timing (eval results JSON). benchmark_eval_analysis.py computes fast_p over p in [0.0, 0.5, 0.8, 1.0, 1.5, 2.0] from is_correct, baseline_speed, and actual_speed; fast_1 (p=1.0) is the fraction of tasks that are correct AND at least as fast as the PyTorch baseline (fast_0 is correctness-only). Report fast_1 for Level 4 only against Level-4 results and a matching, self-generated Level-4 baseline file; do not mix levels or hardware.

5Submit results

There is no central leaderboard submission flow; KernelBench is run locally and reported in-context. Attach to any reported fast_1: level=4, the exact GPU/hardware string and driver/CUDA stack, baseline file name (Level 4 baselines are self-generated and hardware-specific), inference server_type + model_name + temperature, number of samples per task (pass@k context), num_gpu_devices/timeout, and the KernelBench git commit. Compare scores only across identical hardware + baseline + level.

Gotchas

A real CUDA GPU is required for both eval_from_generations.py (kernels are compiled and timed) and generate_baseline_time.py (it uses torch.device('cuda:0')). Without local GPUs, use the Modal variants (generate_and_eval_single_sample_modal.py / generate_baseline_time_modal.py).
Level 4 ships NO baseline timings, AND scripts/generate_baseline_time.py does NOT generate them out of the box: its loop is hardcoded to 'for level in [1, 2, 3]' and its hardware_name/precision are hardcoded in __main__ (no CLI args). You MUST edit the source to add level 4 and set the hardware name, then run it on your own hardware, or benchmark_eval_analysis.py cannot compute fast_1 (the speedup-based thresholds).
fast_1 is correct-AND-faster-than-baseline; it is NOT a plain accuracy. fast_0 is the correctness-only fraction. Don't report fast_0 as fast_1, and keep p, hardware, and baseline consistent. benchmark_eval_analysis.py also emits geometric_mean_speed_ratio_correct_only -- do not confuse it with fast_p.
eval_from_generations.py uses dataset_src=local (reads the kernels you just generated), while generate_samples.py uses dataset_src=huggingface (pulls the level_4 split, 20 rows, from the HF dataset). Mismatching these is a common error. Also set num_gpu_devices to your actual GPU count and a sufficient timeout.