Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise
Sparsity
- URL: http://arxiv.org/abs/2008.13006v1
- Date: Sat, 29 Aug 2020 16:27:41 GMT
- Title: Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise
Sparsity
- Authors: Cong Guo and Bo Yang Hsueh and Jingwen Leng and Yuxian Qiu and Yue
Guan and Zehuan Wang and Xiaoying Jia and Xipeng Li and Minyi Guo and Yuhao
Zhu
- Abstract summary: We propose an algorithm-software co-designed pruning method that achieves latency speedups on existing dense architectures.
We implement and evaluate the sparsity pattern on GPU tensor core, achieving a 1.95x speedup over the dense model.
- Score: 12.643043455369297
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Network pruning can reduce the high computation cost of deep neural network
(DNN) models. However, to maintain their accuracies, sparse models often carry
randomly-distributed weights, leading to irregular computations. Consequently,
sparse models cannot achieve meaningful speedup on commodity hardware (e.g.,
GPU) built for dense matrix computations. As such, prior works usually modify
or design completely new sparsity-optimized architectures for exploiting
sparsity. We propose an algorithm-software co-designed pruning method that
achieves latency speedups on existing dense architectures. Our work builds upon
the insight that the matrix multiplication generally breaks the large matrix
into multiple smaller tiles for parallel execution. We propose a
tiling-friendly "tile-wise" sparsity pattern, which maintains a regular pattern
at the tile level for efficient execution but allows for irregular, arbitrary
pruning at the global scale to maintain the high accuracy. We implement and
evaluate the sparsity pattern on GPU tensor core, achieving a 1.95x speedup
over the dense model.
Related papers
- Compute Better Spent: Replacing Dense Layers with Structured Matrices [77.61728033234233]
We identify more efficient alternatives to dense matrices, as exemplified by the success of convolutional networks in the image domain.
We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance.
We propose a novel matrix family containing Monarch matrices, the Block-Train, which we show performs better than dense for the same compute on multiple tasks.
arXiv Detail & Related papers (2024-06-10T13:25:43Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - PopSparse: Accelerated block sparse matrix multiplication on IPU [0.5661403709207713]
We introduce PopSparse, a library that enables fast sparse operations on Graphcore IPUs.
We target two different types of sparsity: static, where the sparsity pattern is fixed at compile-time; and dynamic, where it can change each time the model is run.
Results indicate that the PopSparse implementations are faster than dense matrix multiplications on IPU at a range of sparsity levels.
arXiv Detail & Related papers (2023-03-29T20:00:19Z) - PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation
Invariant Transformation [15.860204740425791]
We propose Permutation Invariant Transformation (PIT) for dynamic sparsity computation.
PIT transforms micro-tiles into a GPU-efficient dense tile without changing the results.
It can accelerate dynamic sparsity computation by up to 5.9x (average 2.43x) over state-of-the-art compilers.
arXiv Detail & Related papers (2023-01-26T04:50:14Z) - RSC: Accelerating Graph Neural Networks Training via Randomized Sparse
Computations [56.59168541623729]
Training graph neural networks (GNNs) is time consuming because sparse graph-based operations are hard to be accelerated by hardware.
We explore trading off the computational precision to reduce the time complexity via sampling-based approximation.
We propose Randomized Sparse Computation, which for the first time demonstrate the potential of training GNNs with approximated operations.
arXiv Detail & Related papers (2022-10-19T17:25:33Z) - Instant Neural Graphics Primitives with a Multiresolution Hash Encoding [67.33850633281803]
We present a versatile new input encoding that permits the use of a smaller network without sacrificing quality.
A small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through a gradient descent.
We achieve a combined speed of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds.
arXiv Detail & Related papers (2022-01-16T07:22:47Z) - Effective Model Sparsification by Scheduled Grow-and-Prune Methods [73.03533268740605]
We propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models.
Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks.
arXiv Detail & Related papers (2021-06-18T01:03:13Z) - VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator.
textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z) - Accelerating Sparse Deep Neural Networks [20.6942347219753]
We present the design and behavior of Sparse Cores, which exploit a 2:4 (25%) sparsity pattern that leads to twice the math throughput of dense matrix units.
We also describe a simple workflow for training networks that both satisfy the 2:4 sparsity pattern requirements and maintain accuracy.
arXiv Detail & Related papers (2021-04-16T21:27:32Z) - When deep learning models on GPU can be accelerated by taking advantage
of unstructured sparsity [0.0]
This paper is focused on the improvement the efficiency of the sparse convolutional neural networks (CNNs) layers on graphic processing units ( GPU)
The modern CNN models need megabytes of coefficients and needed millions MAC operations to perform convolution.
We show when is worth using a direct sparse operation to speed-up the calculation of the convolution layers.
arXiv Detail & Related papers (2020-11-12T10:13:48Z) - Procrustes: a Dataflow and Accelerator for Sparse Deep Neural Network
Training [0.5219568203653523]
We develop a sparse DNN training accelerator that produces pruned models with the same accuracy as dense models without first training, then pruning, and finally retraining, a dense model.
Compared to training the equivalent unpruned models using a state-of-the-art DNN accelerator without sparse training support, Procrustes consumes up to 3.26$times$ less energy and offers up to 4$times$ speedup across a range of models, while pruning weights by an order of magnitude and maintaining unpruned accuracy.
arXiv Detail & Related papers (2020-09-23T07:39:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.