KernelBench: Can LLMs Write Efficient GPU Kernels?
- URL: http://arxiv.org/abs/2502.10517v1
- Date: Fri, 14 Feb 2025 19:30:53 GMT
- Title: KernelBench: Can LLMs Write Efficient GPU Kernels?
- Authors: Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher RĂ©, Azalia Mirhoseini,
- Abstract summary: KernelBench is an open-source framework for evaluating language models' ability to write fast and correct kernels.
We introduce a new evaluation metric fast_p, which measures the percentage of generated kernels that are functionally correct.
Our experiments show that frontier reasoning models perform the best out of the box but still fall short overall.
- Score: 36.4117525096377
- License:
- Abstract: Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate kernel generation. We introduce KernelBench, an open-source framework for evaluating LMs' ability to write fast and correct kernels on a suite of 250 carefully selected PyTorch ML workloads. KernelBench represents a real-world engineering environment and making progress on the introduced benchmark directly translates to faster practical kernels. We introduce a new evaluation metric fast_p, which measures the percentage of generated kernels that are functionally correct and offer a speedup greater than an adjustable threshold p over baseline. Our experiments across various state-of-the-art models and test-time methods show that frontier reasoning models perform the best out of the box but still fall short overall, matching the PyTorch baseline in less than 20% of the cases. While we show that results can improve by leveraging execution and profiling feedback during iterative refinement, KernelBench remains a challenging benchmark, with its difficulty increasing as we raise speedup threshold p.
Related papers
- A User's Guide to $\texttt{KSig}$: GPU-Accelerated Computation of the Signature Kernel [12.111848705677138]
The signature kernel is a positive definite kernel for sequential and temporal data.
In this chapter, we give a short introduction to $textttKSig$, a $textttScikit-Learn$ compatible Python package that implements various GPU-accelerated algorithms for computing signature kernels.
arXiv Detail & Related papers (2025-01-13T09:11:13Z) - ThunderKittens: Simple, Fast, and Adorable AI Kernels [43.32681787348603]
We present ThunderKittens (TK), a framework for writing performant AI kernels while remaining easy to use and maintain.
We show the value of TK by providing kernels that match or outperform prior kernels for a range of AI operations.
arXiv Detail & Related papers (2024-10-27T10:07:16Z) - Liger Kernel: Efficient Triton Kernels for LLM Training [6.373771349397682]
Training Large Language Models (LLMs) efficiently at scale presents a formidable challenge, driven by their ever-increasing computational demands.
We introduce Liger- Kernel, an open-sourced set of Triton kernels developed specifically for LLM training.
With kernel optimization techniques like kernel operation fusing and input chunking, our kernels achieve on average a 20% increase in training throughput and a 60% reduction in GPU memory usage.
arXiv Detail & Related papers (2024-10-14T18:17:01Z) - KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks.
In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel.
To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z) - Optimal Kernel Tuning Parameter Prediction using Deep Sequence Models [0.44998333629984877]
We propose a methodology that uses deep sequence- to-sequence models to predict the optimal tuning parameters governing compute kernels.
The proposed algorithm can achieve more than 90% accuracy on various convolutional kernels in MIOpen, the AMD machine learning primitives library.
arXiv Detail & Related papers (2024-04-15T22:25:54Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - Snacks: a fast large-scale kernel SVM solver [0.8602553195689513]
Snacks is a new large-scale solver for Kernel Support Vector Machines.
Snacks relies on a Nystr"om approximation of the kernel matrix and an accelerated variant of the subgradient method.
arXiv Detail & Related papers (2023-04-17T04:19:20Z) - Kernel Identification Through Transformers [54.3795894579111]
Kernel selection plays a central role in determining the performance of Gaussian Process (GP) models.
This work addresses the challenge of constructing custom kernel functions for high-dimensional GP regression models.
We introduce a novel approach named KITT: Kernel Identification Through Transformers.
arXiv Detail & Related papers (2021-06-15T14:32:38Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z) - PolyScientist: Automatic Loop Transformations Combined with Microkernels
for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels.
We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.