VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor
Cores
- URL: http://arxiv.org/abs/2310.02065v1
- Date: Tue, 3 Oct 2023 14:08:26 GMT
- Title: VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor
Cores
- Authors: Roberto L. Castro, Andrei Ivanov, Diego Andrade, Tal Ben-Nun, Basilio
B. Fraguela, Torsten Hoefler
- Abstract summary: We show that Spatha, a high-performance sparse-library for Deep Learning routines, achieves up to 37x speedup over cuBLAS.
We also demonstrate a second-order pruning technique that enables sparsification to high sparsity ratios with V:N:M and little to no loss in accuracy in modern transformers.
- Score: 19.28753465771938
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The increasing success and scaling of Deep Learning models demands higher
computational efficiency and power. Sparsification can lead to both smaller
models as well as higher compute efficiency, and accelerated hardware is
becoming available. However, exploiting it efficiently requires kernel
implementations, pruning algorithms, and storage formats, to utilize hardware
support of specialized sparse vector units. An example of those are the
NVIDIA's Sparse Tensor Cores (SPTCs), which promise a 2x speedup. However,
SPTCs only support the 2:4 format, limiting achievable sparsity ratios to 50%.
We present the V:N:M format, which enables the execution of arbitrary N:M
ratios on SPTCs. To efficiently exploit the resulting format, we propose
Spatha, a high-performance sparse-library for DL routines. We show that Spatha
achieves up to 37x speedup over cuBLAS. We also demonstrate a second-order
pruning technique that enables sparsification to high sparsity ratios with
V:N:M and little to no loss in accuracy in modern transformers.
Related papers
- An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency
Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z) - CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and
Favorable Transferability For ViTs [79.54107547233625]
Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks.
We propose a joint compression method for ViTs that offers both high accuracy and fast inference speed.
Our proposed method can achieve state-of-the-art performance across various ViTs.
arXiv Detail & Related papers (2023-09-27T16:12:07Z) - Efficient Quantized Sparse Matrix Operations on Tensor Cores [21.963041375857117]
We propose Magicube, a high-performance sparse-matrix library for low-precision integers on cores.
We show that Magicube achieves on average 1.44x (up to 2.37x) speedup over the vendor-optimized library for sparse kernels, and 1.43x speedup over the state-of-the-art with comparable accuracy for end-to-end Transformer inference.
arXiv Detail & Related papers (2022-09-14T23:52:13Z) - Monarch: Expressive Structured Matrices for Efficient and Accurate
Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune.
A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones.
We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z) - Accelerating DNN Training with Structured Data Gradient Pruning [0.5801044612920815]
Weight pruning is a technique to make Deep Neural Network (DNN) inference more computationally efficient.
Modern accelerators such as the Nvidia A100 GPU support this type of structured sparsity for 2 nonzeros per 4 elements in a reduction.
Our approach can achieve a 15-25% reduction in total training time without significant impact to performance.
arXiv Detail & Related papers (2022-02-01T21:41:51Z) - HANT: Hardware-Aware Network Transformation [82.54824188745887]
We propose hardware-aware network transformation (HANT)
HANT replaces inefficient operations with more efficient alternatives using a neural architecture search like approach.
Our results on accelerating the EfficientNet family show that HANT can accelerate them by up to 3.6x with 0.4% drop in the top-1 accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2021-07-12T18:46:34Z) - Accelerating Sparse Deep Neural Networks [20.6942347219753]
We present the design and behavior of Sparse Cores, which exploit a 2:4 (25%) sparsity pattern that leads to twice the math throughput of dense matrix units.
We also describe a simple workflow for training networks that both satisfy the 2:4 sparsity pattern requirements and maintain accuracy.
arXiv Detail & Related papers (2021-04-16T21:27:32Z) - Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch [75.69506249886622]
Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments.
In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network.
arXiv Detail & Related papers (2021-02-08T05:55:47Z) - FantastIC4: A Hardware-Software Co-Design Approach for Efficiently
Running 4bit-Compact Multilayer Perceptrons [19.411734658680967]
We propose a software-hardware optimization paradigm for obtaining a highly efficient execution engine of deep neural networks (DNNs)
Our approach is centred around compression as a means for reducing the area as well as power requirements of, concretely, multilayer perceptrons (MLPs) with high predictive performances.
We show that we can achieve throughputs of 2.45 TOPS with a total power consumption of 3.6W on a Virtual Ultrascale FPGA XCVU440 device implementation, and achieve a total power efficiency of 20.17 TOPS/W on a 22nm process ASIC version.
arXiv Detail & Related papers (2020-12-17T19:10:04Z) - When deep learning models on GPU can be accelerated by taking advantage
of unstructured sparsity [0.0]
This paper is focused on the improvement the efficiency of the sparse convolutional neural networks (CNNs) layers on graphic processing units ( GPU)
The modern CNN models need megabytes of coefficients and needed millions MAC operations to perform convolution.
We show when is worth using a direct sparse operation to speed-up the calculation of the convolution layers.
arXiv Detail & Related papers (2020-11-12T10:13:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.