Effective Model Sparsification by Scheduled Grow-and-Prune Methods
- URL: http://arxiv.org/abs/2106.09857v1
- Date: Fri, 18 Jun 2021 01:03:13 GMT
- Title: Effective Model Sparsification by Scheduled Grow-and-Prune Methods
- Authors: Xiaolong Ma, Minghai Qin, Fei Sun, Zejiang Hou, Kun Yuan, Yi Xu,
Yanzhi Wang, Yen-Kuang Chen, Rong Jin, Yuan Xie
- Abstract summary: We propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models.
Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks.
- Score: 73.03533268740605
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep neural networks (DNNs) are effective in solving many real-world
problems. Larger DNN models usually exhibit better quality (e.g., accuracy) but
their excessive computation results in long training and inference time. Model
sparsification can reduce the computation and memory cost while maintaining
model quality. Most existing sparsification algorithms unidirectionally remove
weights, while others randomly or greedily explore a small subset of weights in
each layer. The inefficiency of the algorithms reduces the achievable sparsity
level. In addition, many algorithms still require pre-trained dense models and
thus suffer from large memory footprint and long training time. In this paper,
we propose a novel scheduled grow-and-prune (GaP) methodology without
pre-training the dense models. It addresses the shortcomings of the previous
works by repeatedly growing a subset of layers to dense and then pruning back
to sparse after some training. Experiments have shown that such models can
match or beat the quality of highly optimized dense models at 80% sparsity on a
variety of tasks, such as image classification, objective detection, 3D object
part segmentation, and translation. They also outperform other state-of-the-art
(SOTA) pruning methods, including pruning from pre-trained dense models. As an
example, a 90% sparse ResNet-50 obtained via GaP achieves 77.9% top-1 accuracy
on ImageNet, improving the SOTA results by 1.5%.
Related papers
- MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training.
We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z) - Calibrating the Rigged Lottery: Making All Tickets Reliable [14.353428281239665]
We propose a new sparse training method to produce sparse models with improved confidence calibration.
Our method simultaneously maintains or even improves accuracy with only a slight increase in computation and storage burden.
arXiv Detail & Related papers (2023-02-18T15:53:55Z) - Dynamic Sparse Training via Balancing the Exploration-Exploitation
Trade-off [19.230329532065635]
Sparse training could significantly mitigate the training costs by reducing the model size.
Existing sparse training methods mainly use either random-based or greedy-based drop-and-grow strategies.
In this work, we consider the dynamic sparse training as a sparse connectivity search problem.
Experimental results show that sparse models (up to 98% sparsity) obtained by our proposed method outperform the SOTA sparse training methods.
arXiv Detail & Related papers (2022-11-30T01:22:25Z) - Towards Sparsification of Graph Neural Networks [9.568566305616656]
We use two state-of-the-art model compression methods to train and prune and sparse training for the sparsification of weight layers in GNNs.
We evaluate and compare the efficiency of both methods in terms of accuracy, training sparsity, and training FLOPs on real-world graphs.
arXiv Detail & Related papers (2022-09-11T01:39:29Z) - Two Heads are Better than One: Robust Learning Meets Multi-branch Models [14.72099568017039]
We propose Branch Orthogonality adveRsarial Training (BORT) to obtain state-of-the-art performance with solely the original dataset for adversarial training.
We evaluate our approach on CIFAR-10, CIFAR-100, and SVHN against ell_infty norm-bounded perturbations of size epsilon = 8/255, respectively.
arXiv Detail & Related papers (2022-08-17T05:42:59Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - AC/DC: Alternating Compressed/DeCompressed Training of Deep Neural
Networks [78.62086125399831]
We present a general approach called Alternating Compressed/DeCompressed (AC/DC) training of deep neural networks (DNNs)
AC/DC outperforms existing sparse training methods in accuracy at similar computational budgets.
An important property of AC/DC is that it allows co-training of dense and sparse models, yielding accurate sparse-dense model pairs at the end of the training process.
arXiv Detail & Related papers (2021-06-23T13:23:00Z) - Dynamic Model Pruning with Feedback [64.019079257231]
We propose a novel model compression method that generates a sparse trained model without additional overhead.
We evaluate our method on CIFAR-10 and ImageNet, and show that the obtained sparse models can reach the state-of-the-art performance of dense models.
arXiv Detail & Related papers (2020-06-12T15:07:08Z) - Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality
Regularization and Singular Value Sparsification [53.50708351813565]
We propose SVD training, the first method to explicitly achieve low-rank DNNs during training without applying SVD on every step.
We empirically show that SVD training can significantly reduce the rank of DNN layers and achieve higher reduction on computation load under the same accuracy.
arXiv Detail & Related papers (2020-04-20T02:40:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.