Run KernelBench (Hard)
The same run guide is also available from the benchmark detail page.
KernelBench evaluates whether an LLM can write fast, correct CUDA/GPU kernels that replace PyTorch operators. You run three official scripts: generate samples from your model, evaluate the generated kernels on a real GPU, then compute the fast_p metric. Level 4 ('hard' / Hugging Face) has 20 tasks of whole-model architectures and ships NO baseline timings, so to score fast_1 you must first generate your own hardware-specific baseline times (which requires editing generate_baseline_time.py to include level 4 and to set the hardware name). Keep attached to any score: the level (=4), GPU/hardware identifier, the inference model_name/server_type, temperature, num samples per task, KernelBench commit, and the baseline file used.
1Install
git clone https://github.com/ScalingIntelligence/KernelBench.gitcd KernelBenchuv syncuv sync --extra gpucp .env.example .env # then add your inference API key(s); KernelBench uses litellm2Run evaluation
# 1) Generate kernels from your model for Level 4 (hard / HuggingFace). Swap server_type/model_name for your provider.
uv run python scripts/generate_samples.py run_name=my_level4_run dataset_src=huggingface level=4 num_workers=50 server_type=deepseek model_name=deepseek-chat temperature=0# 2) Level 4 ships NO provided baselines: generate hardware-specific PyTorch baseline times first. NOTE: scripts/generate_baseline_time.py takes NO CLI args and hardcodes 'for level in [1, 2, 3]' plus a hardcoded hardware_name/precision in its __main__ block. To produce Level 4 baselines you MUST edit the source to include level 4 and set hardware_name to match the 'hardware=' you will pass in step 4. Then run it (CUDA required):
uv run python scripts/generate_baseline_time.py# 3) Evaluate the generated kernels on real GPU(s). Set num_gpu_devices to your GPU count.
uv run python scripts/eval_from_generations.py run_name=my_level4_run dataset_src=local level=4 num_gpu_devices=8 timeout=3003Score output
# Compute fast_p (includes fast_0/fast_1/fast_2). hardware + baseline must exactly match the baseline file you generated (README example: hardware=L40S_matx3 baseline=baseline_time_torch).
uv run python scripts/benchmark_eval_analysis.py run_name=my_level4_run level=4 hardware=<your_gpu_id> baseline=<your_baseline_name>4Expected output
generate_samples.py writes generated kernel code per task under runs/<run_name>/. eval_from_generations.py compiles and runs each kernel on the GPU, producing per-task correctness and timing (eval results JSON). benchmark_eval_analysis.py computes fast_p over p in [0.0, 0.5, 0.8, 1.0, 1.5, 2.0] from is_correct, baseline_speed, and actual_speed; fast_1 (p=1.0) is the fraction of tasks that are correct AND at least as fast as the PyTorch baseline (fast_0 is correctness-only). Report fast_1 for Level 4 only against Level-4 results and a matching, self-generated Level-4 baseline file; do not mix levels or hardware.
5Submit results
There is no central leaderboard submission flow; KernelBench is run locally and reported in-context. Attach to any reported fast_1: level=4, the exact GPU/hardware string and driver/CUDA stack, baseline file name (Level 4 baselines are self-generated and hardware-specific), inference server_type + model_name + temperature, number of samples per task (pass@k context), num_gpu_devices/timeout, and the KernelBench git commit. Compare scores only across identical hardware + baseline + level.