ThunderKittens: Simple, Fast, and Adorable AI Kernels
- URL: http://arxiv.org/abs/2410.20399v1
- Date: Sun, 27 Oct 2024 10:07:16 GMT
- Title: ThunderKittens: Simple, Fast, and Adorable AI Kernels
- Authors: Benjamin F. Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu, Christopher RĂ©,
- Abstract summary: We present ThunderKittens (TK), a framework for writing performant AI kernels while remaining easy to use and maintain.
We show the value of TK by providing kernels that match or outperform prior kernels for a range of AI operations.
- Score: 43.32681787348603
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The challenge of mapping AI architectures to GPU hardware is creating a critical bottleneck in AI progress. Despite substantial efforts, hand-written custom kernels fail to meet their theoretical performance thresholds, even on well-established operations like linear attention. The diverse hardware capabilities of GPUs might suggest that we need a wide variety of techniques to achieve high performance. However, our work explores whether a small number of key abstractions can drastically simplify the process. We present ThunderKittens (TK), a framework for writing performant AI kernels while remaining easy to use and maintain. Our abstractions map to the three levels of the GPU hierarchy: (1) at the warp-level, we provide 16x16 matrix tiles as basic data structures and PyTorch-like parallel compute operations over tiles, (2) at the thread-block level, we provide a template for overlapping asynchronous operations across parallel warps, and (3) at the grid-level, we provide support to help hide the block launch and tear-down, and memory costs. We show the value of TK by providing kernels that match or outperform prior kernels for a range of AI operations. We match CuBLAS and FlashAttention-3 on GEMM and attention inference performance and outperform the strongest baselines by $10-40\%$ on attention backwards, $8\times$ on state space models, and $14\times$ on linear attention.
Related papers
- Challenging GPU Dominance: When CPUs Outperform for On-Device LLM Inference [6.829272097221596]
We show that a CPU-only configuration achieves 17 tokens per second, surpassing the 12.8 tokens per second obtained with GPU acceleration.<n>We analyze the factors driving this counterintuitive result, revealing that GPU memory transfer overhead and CPU thread optimization play a critical role.<n>Our findings challenge conventional GPU-first thinking, highlighting the untapped potential of optimized CPU inference.
arXiv Detail & Related papers (2025-05-09T23:05:53Z) - TileLang: A Composable Tiled Programming Model for AI Systems [17.240134151647187]
We present TileLang, a generalized tiled programming model for more efficient AI programming.
TileLang decouples scheduling space (thread binding, layout, tensorize and pipeline) from dataflow, and encapsulated them as a set of customization annotations and primitives.
We conduct comprehensive experiments on commonly-used devices, across numerous experiments, our evaluation shows that TileLang can achieve state-of-the-art performance in key kernels.
arXiv Detail & Related papers (2025-04-24T14:08:49Z) - ML-Triton, A Multi-Level Compilation and Language Extension to Triton GPU Programming [2.4665562732779773]
Triton is a DSL that offers a more user-friendly and portable alternative by programming at a higher level.
We propose ML-Triton which features multi-level compilation flow and programming interface.
Our approach achieves performance above 95% of expert-written kernels on Intel GPU.
arXiv Detail & Related papers (2025-03-19T08:31:39Z) - KernelBench: Can LLMs Write Efficient GPU Kernels? [36.4117525096377]
KernelBench is an open-source framework for evaluating language models' ability to write fast and correct kernels.
We introduce a new evaluation metric fast_p, which measures the percentage of generated kernels that are functionally correct.
Our experiments show that frontier reasoning models perform the best out of the box but still fall short overall.
arXiv Detail & Related papers (2025-02-14T19:30:53Z) - FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness [0.0]
Methods like FlashAttention have achieved a x6 performance improvement over native PyTorch by avoiding unnecessary data transfers.
This paper extends Neural Circuit Diagrams for deep learning models to consider resource usage and the distribution of tasks across a GPU hierarchy.
We develop a methodology for representing intermediate-level pseudocode with diagrams, allowing hardware-aware algorithms to be derived step-by-step.
arXiv Detail & Related papers (2024-12-04T13:52:04Z) - Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters [10.403248386029407]
Self-attention is a significant computational bottleneck due to its complexity in the sequence length.
In this work, we derive the scalar energy function whose gradient computes the self-attention block.
Our formulation reveals that the reduction across the sequence axis can be efficiently computed in parallel through a tree reduction.
arXiv Detail & Related papers (2024-08-07T21:16:55Z) - Accelerating Machine Learning Primitives on Commodity Hardware [0.0]
We present an extensive study of the Sliding Window convolution technique as a more efficient alternative to the commonly used General Matrix multiplication (GEMM) based convolution in Deep Neural Networks (DNNs)
Our results suggest that the Sliding Window computation kernels can outperform GEMM-based convolution on a CPU and even on dedicated hardware accelerators.
This could promote a wider adoption of AI on low-power and low-memory devices without the need for specialized hardware.
arXiv Detail & Related papers (2023-10-08T16:26:18Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications.
We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z) - VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator.
textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z) - Heterogeneous CPU+GPU Stochastic Gradient Descent Algorithms [1.3249453757295084]
We study training algorithms for deep learning on heterogeneous CPU+GPU architectures.
Our two-fold objective -- maximize convergence rate and resource utilization simultaneously -- makes the problem challenging.
We show that the implementation of these algorithms achieves both faster convergence and higher resource utilization than on several real datasets.
arXiv Detail & Related papers (2020-04-19T05:21:20Z) - MPLP++: Fast, Parallel Dual Block-Coordinate Ascent for Dense Graphical
Models [96.1052289276254]
This work introduces a new MAP-solver, based on the popular Dual Block-Coordinate Ascent principle.
Surprisingly, by making a small change to the low-performing solver, we derive the new solver MPLP++ that significantly outperforms all existing solvers by a large margin.
arXiv Detail & Related papers (2020-04-16T16:20:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.