Liger Kernel: Efficient Triton Kernels for LLM Training
- URL: http://arxiv.org/abs/2410.10989v3
- Date: Fri, 24 Jan 2025 00:14:55 GMT
- Title: Liger Kernel: Efficient Triton Kernels for LLM Training
- Authors: Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen,
- Abstract summary: Training Large Language Models (LLMs) efficiently at scale presents a formidable challenge, driven by their ever-increasing computational demands.
We introduce Liger- Kernel, an open-sourced set of Triton kernels developed specifically for LLM training.
With kernel optimization techniques like kernel operation fusing and input chunking, our kernels achieve on average a 20% increase in training throughput and a 60% reduction in GPU memory usage.
- Score: 6.373771349397682
- License:
- Abstract: Training Large Language Models (LLMs) efficiently at scale presents a formidable challenge, driven by their ever-increasing computational demands and the need for enhanced performance. In this work, we introduce Liger-Kernel, an open-sourced set of Triton kernels developed specifically for LLM training. With kernel optimization techniques like kernel operation fusing and input chunking, our kernels achieve on average a 20% increase in training throughput and a 60% reduction in GPU memory usage for popular LLMs compared to HuggingFace implementations. In addition, Liger-Kernel is designed with modularity, accessibility, and adaptability in mind, catering to both casual and expert users. Comprehensive benchmarks and integration tests are built in to ensure compatibility, performance, correctness, and convergence across diverse computing environments and model architectures. The source code is available under a permissive license at: github.com/linkedin/Liger-Kernel.
Related papers
- KernelBench: Can LLMs Write Efficient GPU Kernels? [36.4117525096377]
KernelBench is an open-source framework for evaluating language models' ability to write fast and correct kernels.
We introduce a new evaluation metric fast_p, which measures the percentage of generated kernels that are functionally correct.
Our experiments show that frontier reasoning models perform the best out of the box but still fall short overall.
arXiv Detail & Related papers (2025-02-14T19:30:53Z) - Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers [65.35142508909892]
We present a novel four-dimensional hybrid parallel algorithm implemented in a highly scalable, portable, open-source framework called AxoNN.
We demonstrate fine-tuning of a 405-billion parameter LLM using AxoNN on Frontier.
arXiv Detail & Related papers (2025-02-12T06:05:52Z) - An Efficient Sparse Kernel Generator for O(3)-Equivariant Deep Networks [0.5737287537823071]
Rotation equivariant graph neural networks yield state-of-the-art performance on spatial deep learning tasks.
Key to these models is the Clebsch-Gordon (CG) tensor product, a kernel that contracts two dense feature vectors with a highly structured sparse tensor to produce a dense output vector.
We introduce a GPU sparse kernel generator for the CG tensor product that provides significant speedup over the best existing open and closed-source implementations.
arXiv Detail & Related papers (2025-01-23T08:20:47Z) - Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs [0.8217552831952]
Large language models (LLMs) have transformed the way we think about language understanding and generation.
Group quantization formats commonly used for LLM quantization have significant compute overheads and a resource-intensive dequantization process.
We present a groupwise non-uniform codebook-based quantization method for ultra-low-precision quantization of LLMs to better match non-uniform patterns in their weight distributions.
arXiv Detail & Related papers (2024-12-23T03:44:29Z) - TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training [17.157552816494427]
This paper introduces TorchTitan, an open-source, PyTorch-native distributed training system.
It unifies state-of-the-art techniques, streamlining integration and reducing overhead.
We evaluate TorchTitan on the Llama 3.1 family of large language models (LLMs)
arXiv Detail & Related papers (2024-10-09T03:26:11Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - CoLLiE: Collaborative Training of Large Language Models in an Efficient
Way [59.09824823710863]
CoLLiE is an efficient library that facilitates collaborative training of large language models.
With its modular design and comprehensive functionality, CoLLiE offers a balanced blend of efficiency, ease of use, and customization.
arXiv Detail & Related papers (2023-12-01T08:02:16Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - PolyScientist: Automatic Loop Transformations Combined with Microkernels
for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels.
We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.