Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch
- URL: http://arxiv.org/abs/2102.04010v1
- Date: Mon, 8 Feb 2021 05:55:47 GMT
- Title: Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch
- Authors: Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhijie Zhang, Kun Yuan,
Wenxiu Sun, Hongsheng Li
- Abstract summary: Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments.
In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network.
- Score: 75.69506249886622
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress
and accelerate the models on resource-constrained environments. It can be
generally categorized into unstructured fine-grained sparsity that zeroes out
multiple individual weights distributed across the neural network, and
structured coarse-grained sparsity which prunes blocks of sub-networks of a
neural network. Fine-grained sparsity can achieve a high compression ratio but
is not hardware friendly and hence receives limited speed gains. On the other
hand, coarse-grained sparsity cannot concurrently achieve both apparent
acceleration on modern GPUs and decent performance. In this paper, we are the
first to study training from scratch an N:M fine-grained structured sparse
network, which can maintain the advantages of both unstructured fine-grained
sparsity and structured coarse-grained sparsity simultaneously on specifically
designed GPUs. Specifically, a 2:4 sparse network could achieve 2x speed-up
without performance drop on Nvidia A100 GPUs. Furthermore, we propose a novel
and effective ingredient, sparse-refined straight-through estimator (SR-STE),
to alleviate the negative influence of the approximated gradients computed by
vanilla STE during optimization. We also define a metric, Sparse Architecture
Divergence (SAD), to measure the sparse network's topology change during the
training process. Finally, We justify SR-STE's advantages with SAD and
demonstrate the effectiveness of SR-STE by performing comprehensive experiments
on various tasks. Source codes and models are available at
https://github.com/NM-sparsity/NM-sparsity.
Related papers
- Kronecker-Factored Approximate Curvature for Modern Neural Network
Architectures [85.76673783330334]
Two different settings of linear weight-sharing layers motivate two flavours of Kronecker-Factored Approximate Curvature (K-FAC)
We show they are exact for deep linear networks with weight-sharing in their respective setting.
We observe little difference between these two K-FAC variations when using them to train both a graph neural network and a vision transformer.
arXiv Detail & Related papers (2023-11-01T16:37:00Z) - Speed Limits for Deep Learning [67.69149326107103]
Recent advancement in thermodynamics allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network.
We provide analytical expressions for these speed limits for linear and linearizable neural networks.
Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense.
arXiv Detail & Related papers (2023-07-27T06:59:46Z) - Dynamic Sparsity Is Channel-Level Sparsity Learner [91.31071026340746]
Dynamic sparse training (DST) is a leading sparse training approach.
Channel-aware dynamic sparse (Chase) seamlessly translates the promise of unstructured dynamic sparsity to channel-level sparsity.
Our approach translates unstructured sparsity to channel-wise sparsity.
arXiv Detail & Related papers (2023-05-30T23:33:45Z) - Dynamic Sparse Training with Structured Sparsity [11.778353786208765]
Dynamic Sparse Training (DST) methods achieve state-of-the-art results in sparse neural network training.
We propose a sparse-to-sparse DST method, Structured RigL (SRigL), to learn a variant of fine-grained structured N:M sparsity.
We demonstrate a real-world acceleration of 3.4x/2.5x on CPU for online inference and 1.7x/13.0x on GPU for inference with a batch size of 256.
arXiv Detail & Related papers (2023-05-03T17:48:55Z) - Iterative Soft Shrinkage Learning for Efficient Image Super-Resolution [91.3781512926942]
Image super-resolution (SR) has witnessed extensive neural network designs from CNN to transformer architectures.
This work investigates the potential of network pruning for super-resolution iteration to take advantage of off-the-shelf network designs and reduce the underlying computational overhead.
We propose a novel Iterative Soft Shrinkage-Percentage (ISS-P) method by optimizing the sparse structure of a randomly network at each and tweaking unimportant weights with a small amount proportional to the magnitude scale on-the-fly.
arXiv Detail & Related papers (2023-03-16T21:06:13Z) - FSCNN: A Fast Sparse Convolution Neural Network Inference System [31.474696818171953]
Convolution neural networks (CNNs) have achieved remarkable success, but typically accompany high computation cost and numerous redundant weight parameters.
To reduce the FLOPs, structure pruning is a popular approach to remove the entire hidden structures via introducing coarse-grained sparsity.
We present an efficient convolution neural network inference system to accelerate its forward pass by utilizing the fine-grained sparsity of compressed CNNs.
arXiv Detail & Related papers (2022-12-17T06:44:58Z) - Speedup deep learning models on GPU by taking advantage of efficient
unstructured pruning and bit-width reduction [0.0]
This work is focused on the pruning of some convolutional neural networks (CNNs) and improving theirs efficiency on graphic processing units ( GPU)
The Nvidia deep neural network (cuDnn) library is the most effective implementations of deep learning (DL) algorithms for GPUs.
arXiv Detail & Related papers (2021-12-28T19:36:41Z) - HALO: Learning to Prune Neural Networks with Shrinkage [5.283963846188862]
Deep neural networks achieve state-of-the-art performance in a variety of tasks by extracting a rich set of features from unstructured data.
Modern techniques for inducing sparsity and reducing model size are (1) network pruning, (2) training with a sparsity inducing penalty, and (3) training a binary mask jointly with the weights of the network.
We present a novel penalty called Hierarchical Adaptive Lasso which learns to adaptively sparsify weights of a given network via trainable parameters.
arXiv Detail & Related papers (2020-08-24T04:08:48Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.