Masked Matrix Multiplication for Emergent Sparsity
- URL: http://arxiv.org/abs/2402.14118v1
- Date: Wed, 21 Feb 2024 20:36:08 GMT
- Title: Masked Matrix Multiplication for Emergent Sparsity
- Authors: Brian Wheatman, Meghana Madhyastha, and Randal Burns
- Abstract summary: Transformer models exhibit emergent sparsity in which computations perform selective sparse access to dense data.
We build a vectorized and parallel matrix-multiplication system A X B = C that eliminates unnecessary computations.
- Score: 1.4786952412297807
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Artificial intelligence workloads, especially transformer models, exhibit
emergent sparsity in which computations perform selective sparse access to
dense data. The workloads are inefficient on hardware designed for dense
computations and do not map well onto sparse data representations. We build a
vectorized and parallel matrix-multiplication system A X B = C that eliminates
unnecessary computations and avoids branches based on a runtime evaluation of
sparsity. We use a combination of dynamic code lookup to adapt to the specific
sparsity encoded in the B matrix and preprocessing of sparsity maps of the A
and B matrices to compute conditional branches once for the whole computation.
For a wide range of sparsity, from 60% to 95% zeros, our implementation
performs fewer instructions and increases performance when compared with Intel
MKL's dense or sparse matrix multiply routines. Benefits can be as large as 2
times speedup and 4 times fewer instructions.
Related papers
- SMM-Conv: Scalar Matrix Multiplication with Zero Packing for Accelerated Convolution [4.14360329494344]
We present a novel approach for accelerating convolutions during inference for CPU-based architectures.
Our experiments with commonly used network architectures demonstrate a significant speedup compared to existing indirect methods.
arXiv Detail & Related papers (2024-11-23T21:43:38Z) - Optimized Inference for 1.58-bit LLMs: A Time and Memory-Efficient Algorithm for Binary and Ternary Matrix Multiplication [8.779871128906787]
Large Language Models (LLMs) suffer from inference inefficiency while relying on advanced computational infrastructure.
We propose algorithms to improve the inference time and memory efficiency of 1.58-bit LLMs with ternary weight matrices.
Our results confirm the superiority of the approach both with respect to time and memory, as we observed a reduction in inference time up to 29x and memory usage up to 6x.
arXiv Detail & Related papers (2024-11-10T04:56:14Z) - Compute Better Spent: Replacing Dense Layers with Structured Matrices [77.61728033234233]
We identify more efficient alternatives to dense matrices, as exemplified by the success of convolutional networks in the image domain.
We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance.
We propose a novel matrix family containing Monarch matrices, the Block-Train, which we show performs better than dense for the same compute on multiple tasks.
arXiv Detail & Related papers (2024-06-10T13:25:43Z) - AMULET: Adaptive Matrix-Multiplication-Like Tasks [6.094431019524036]
We extend an open-source compiler to recognize and optimize matrix multiplication-like tasks.
Our framework, called Amulet, uses both database-style and compiler optimization techniques.
Amulet typically performs within 15% of hand-tuned matrix multiplication libraries, while handling a much broader class of computations.
arXiv Detail & Related papers (2023-05-12T17:04:24Z) - PopSparse: Accelerated block sparse matrix multiplication on IPU [0.5661403709207713]
We introduce PopSparse, a library that enables fast sparse operations on Graphcore IPUs.
We target two different types of sparsity: static, where the sparsity pattern is fixed at compile-time; and dynamic, where it can change each time the model is run.
Results indicate that the PopSparse implementations are faster than dense matrix multiplications on IPU at a range of sparsity levels.
arXiv Detail & Related papers (2023-03-29T20:00:19Z) - RSC: Accelerating Graph Neural Networks Training via Randomized Sparse
Computations [56.59168541623729]
Training graph neural networks (GNNs) is time consuming because sparse graph-based operations are hard to be accelerated by hardware.
We explore trading off the computational precision to reduce the time complexity via sampling-based approximation.
We propose Randomized Sparse Computation, which for the first time demonstrate the potential of training GNNs with approximated operations.
arXiv Detail & Related papers (2022-10-19T17:25:33Z) - Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications.
We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z) - Monarch: Expressive Structured Matrices for Efficient and Accurate
Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune.
A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones.
We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z) - Efficient GPU implementation of randomized SVD and its applications [17.71779625877989]
Matrix decompositions are ubiquitous in machine learning, including applications in dimensionality data compression and deep learning algorithms.
Typical solutions for matrix decompositions have complexity which significantly increases their computational cost and time.
We leverage efficient processing operations that can be run in parallel on modern Graphical Processing Units (GPUs) to reduce the computational burden of computing matrix decompositions.
arXiv Detail & Related papers (2021-10-05T07:42:41Z) - Non-PSD Matrix Sketching with Applications to Regression and
Optimization [56.730993511802865]
We present dimensionality reduction methods for non-PSD and square-roots" matrices.
We show how these techniques can be used for multiple downstream tasks.
arXiv Detail & Related papers (2021-06-16T04:07:48Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.