Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with
SplitK work decomposition
- URL: http://arxiv.org/abs/2402.00025v2
- Date: Thu, 22 Feb 2024 20:38:47 GMT
- Title: Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with
SplitK work decomposition
- Authors: Adnan Hoque, Less Wright, Chih-Chieh Yang, Mudhakar Srivatsa, Raghu
Ganti
- Abstract summary: We propose an implementation of an efficient fused matrix multiplication kernel for W4A16 quantized inference.
Our implementation shows improvement for the type of skinny matrix-matrix multiplications found in foundation model inference workloads.
- Score: 0.44998333629984877
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose an implementation of an efficient fused matrix multiplication
kernel for W4A16 quantized inference, where we perform dequantization and GEMM
in a fused kernel using a SplitK work decomposition. Our implementation shows
improvement for the type of skinny matrix-matrix multiplications found in
foundation model inference workloads. In particular, this paper surveys the
type of matrix multiplication between a skinny activation matrix and a square
weight matrix. Our results show an average of 65% speed improvement on A100,
and an average of 124% speed improvement on H100 (with a peak of 295%) for a
range of matrix dimensions including those found in a llama-style model, where
m < n = k.
Related papers
- Orthogonal Finetuning Made Scalable [87.49040247077389]
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment.<n>We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity.<n>We propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic.<n>These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without compromising performance.
arXiv Detail & Related papers (2025-06-24T17:59:49Z) - A Nonlinear Hash-based Optimization Method for SpMV on GPUs [19.6395697341071]
We highlight the effectiveness of hash-based techniques in optimizing sparse matrix reordering.
In this paper, we introduce the Hash-based Partition (HBP) format, a lightweight SpMV approach.
In experiments, our method offers an average speedup of 3.53 times compared to the sorting approach and 3.67 times compared to the dynamic programming method employed in Regu2D.
arXiv Detail & Related papers (2025-04-11T08:31:44Z) - SMM-Conv: Scalar Matrix Multiplication with Zero Packing for Accelerated Convolution [4.14360329494344]
We present a novel approach for accelerating convolutions during inference for CPU-based architectures.
Our experiments with commonly used network architectures demonstrate a significant speedup compared to existing indirect methods.
arXiv Detail & Related papers (2024-11-23T21:43:38Z) - Compute Better Spent: Replacing Dense Layers with Structured Matrices [77.61728033234233]
We identify more efficient alternatives to dense matrices, as exemplified by the success of convolutional networks in the image domain.
We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance.
We propose a novel matrix family containing Monarch matrices, the Block-Train, which we show performs better than dense for the same compute on multiple tasks.
arXiv Detail & Related papers (2024-06-10T13:25:43Z) - Fast inference with Kronecker-sparse matrices [4.387337528923525]
We present the first energy and time benchmarks for the multiplication with Kronecker-sparse matrices.
Our benchmark also reveals that specialized implementations spend up to 50% of their total runtime on memory rewriting operations.
We implement a new kernel that achieves a median speed-up of x1.4, while also cutting energy consumption by 15%.
arXiv Detail & Related papers (2024-05-23T19:36:10Z) - LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning [66.85589263870702]
Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component.
Experiments on finetuning RoBERTa and LLaMA-2 demonstrate that our low-rank plus quantized matrix decomposition approach (LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines.
arXiv Detail & Related papers (2023-11-20T18:57:41Z) - An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks.
The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions.
We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z) - Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications.
We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z) - Learning in High-Dimensional Feature Spaces Using ANOVA-Based Fast
Matrix-Vector Multiplication [0.0]
kernel matrix is typically dense and large-scale. Depending on the dimension of the feature space even the computation of all of its entries in reasonable time becomes a challenging task.
We propose the use of an ANOVA kernel, where we construct several kernels based on lower-dimensional feature spaces for which we provide fast algorithms realizing the matrix-vector products.
Based on a feature grouping approach, we then show how the fast matrix-vector products can be embedded into a learning method choosing kernel ridge regression and the preconditioned conjugate gradient solver.
arXiv Detail & Related papers (2021-11-19T10:29:39Z) - Robust 1-bit Compressive Sensing with Partial Gaussian Circulant
Matrices and Generative Priors [54.936314353063494]
We provide recovery guarantees for a correlation-based optimization algorithm for robust 1-bit compressive sensing.
We make use of a practical iterative algorithm, and perform numerical experiments on image datasets to corroborate our results.
arXiv Detail & Related papers (2021-08-08T05:28:06Z) - Doping: A technique for efficient compression of LSTM models using
sparse structured additive matrices [14.321761305835972]
We propose the notion of doping -- addition of an extremely sparse matrix to a structured matrix.
Doping facilitates additional degrees of freedom for a small number of parameters, allowing them to independently diverge from the fixed structure.
We show that doped KP compression technique outperforms previous state-of-the art compression results by achieving 1.3 - 2.4x higher compression factor at a similar accuracy.
arXiv Detail & Related papers (2021-02-14T05:14:09Z) - Direct Spatial Implementation of Sparse Matrix Multipliers for Reservoir
Computing [0.0]
Reservoir computing systems rely on the recurrent multiplication of a very large, sparse, fixed matrix.
We argue that direct implementation of these fixed matrices minimizes the work performed in the computation.
We present the structure of our bit-serial matrix multiplier, and evaluate using canonical signed digit representation to further reduce logic utilization.
arXiv Detail & Related papers (2021-01-21T23:16:22Z) - SimpleMKKM: Simple Multiple Kernel K-means [49.500663154085586]
We propose a simple yet effective multiple kernel clustering algorithm, termed simple multiple kernel k-means (SimpleMKKM)
Our criterion is given by an intractable minimization-maximization problem in the kernel coefficient and clustering partition matrix.
We theoretically analyze the performance of SimpleMKKM in terms of its clustering generalization error.
arXiv Detail & Related papers (2020-05-11T10:06:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.