evals.report
BenchmarksLabsCompareRun guides

KernelBench (Hard)

Stanford KernelBench's hardest tier: generate correct, high-performance GPU (CUDA) kernels from PyTorch reference operators, scored on the fraction of kernels that are both correct and faster than the baseline (fast₁).

Codingfast₁Higher is better

What this benchmark measures

Stanford KernelBench's hardest tier: generate correct, high-performance GPU (CUDA) kernels from PyTorch reference operators, scored on the fraction of kernels that are both correct and faster than the baseline (fast₁).

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is fast₁. It should be interpreted within KernelBench (Hard), not compared as part of a site-wide ranking.

What to be careful about

fast₁ depends heavily on the GPU and tolerance; MiniMax reported its Hard-tier run on NVIDIA Blackwell GPUs.

No composite ranking
evals.report never combines benchmarks. fast₁ on KernelBench (Hard) is its own number — don’t average it with other metrics.