Powerpropagation: A sparsity inducing weight reparameterisation
- URL: http://arxiv.org/abs/2110.00296v1
- Date: Fri, 1 Oct 2021 10:03:57 GMT
- Title: Powerpropagation: A sparsity inducing weight reparameterisation
- Authors: Jonathan Schwarz and Siddhant M. Jayakumar and Razvan Pascanu and
Peter E. Latham and Yee Whye Teh
- Abstract summary: We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
- Score: 65.85142037667065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The training of sparse neural networks is becoming an increasingly important
tool for reducing the computational footprint of models at training and
evaluation, as well enabling the effective scaling up of models. Whereas much
work over the years has been dedicated to specialised pruning techniques,
little attention has been paid to the inherent effect of gradient based
training on model sparsity. In this work, we introduce Powerpropagation, a new
weight-parameterisation for neural networks that leads to inherently sparse
models. Exploiting the behaviour of gradient descent, our method gives rise to
weight updates exhibiting a "rich get richer" dynamic, leaving low-magnitude
parameters largely unaffected by learning. Models trained in this manner
exhibit similar performance, but have a distribution with markedly higher
density at zero, allowing more parameters to be pruned safely. Powerpropagation
is general, intuitive, cheap and straight-forward to implement and can readily
be combined with various other techniques. To highlight its versatility, we
explore it in two very different settings: Firstly, following a recent line of
work, we investigate its effect on sparse training for resource-constrained
settings. Here, we combine Powerpropagation with a traditional weight-pruning
technique as well as recent state-of-the-art sparse-to-sparse algorithms,
showing superior performance on the ImageNet benchmark. Secondly, we advocate
the use of sparsity in overcoming catastrophic forgetting, where compressed
representations allow accommodating a large number of tasks at fixed model
capacity. In all cases our reparameterisation considerably increases the
efficacy of the off-the-shelf methods.
Related papers
- SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation [52.6922833948127]
In this work, we investigate the importance of parameters in pre-trained diffusion models.
We propose a novel model fine-tuning method to make full use of these ineffective parameters.
Our method enhances the generative capabilities of pre-trained models in downstream applications.
arXiv Detail & Related papers (2024-09-10T16:44:47Z) - HyperSparse Neural Networks: Shifting Exploration to Exploitation
through Adaptive Regularization [18.786142528591355]
Sparse neural networks are a key factor in developing resource-efficient machine learning applications.
We propose the novel and powerful sparse learning method Adaptive Regularized Training (ART) to compress dense into sparse networks.
Our method compresses the pre-trained model knowledge into the weights of highest magnitude.
arXiv Detail & Related papers (2023-08-14T14:18:11Z) - FOSTER: Feature Boosting and Compression for Class-Incremental Learning [52.603520403933985]
Deep neural networks suffer from catastrophic forgetting when learning new categories.
We propose a novel two-stage learning paradigm FOSTER, empowering the model to learn new categories adaptively.
arXiv Detail & Related papers (2022-04-10T11:38:33Z) - Dynamic Collective Intelligence Learning: Finding Efficient Sparse Model
via Refined Gradients for Pruned Weights [31.68257673664519]
Dynamic pruning methods try to find diverse sparsity patterns during training by utilizing Straight-Through-Estimator (STE) to approximate gradients of pruned weights.
We introduce refined gradients to update the pruned weights by forming dual forwarding paths from two sets (pruned and unpruned) of weights.
We propose a novel Dynamic Collective Intelligence Learning (DCIL) which makes use of the learning synergy between the collective intelligence of both weight sets.
arXiv Detail & Related papers (2021-09-10T04:41:17Z) - Top-KAST: Top-K Always Sparse Training [50.05611544535801]
We propose Top-KAST, a method that preserves constant sparsity throughout training.
We show that it performs comparably to or better than previous works when training models on the established ImageNet benchmark.
In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling.
arXiv Detail & Related papers (2021-06-07T11:13:05Z) - Provable Benefits of Overparameterization in Model Compression: From
Double Descent to Pruning Neural Networks [38.153825455980645]
Recent empirical evidence indicates that the practice of overization not only benefits training large models, but also assists - perhaps counterintuitively - building lightweight models.
This paper sheds light on these empirical findings by theoretically characterizing the high-dimensional toolsets of model pruning.
We analytically identify regimes in which, even if the location of the most informative features is known, we are better off fitting a large model and then pruning.
arXiv Detail & Related papers (2020-12-16T05:13:30Z) - Towards Practical Lipreading with Distilled and Efficient Models [57.41253104365274]
Lipreading has witnessed a lot of progress due to the resurgence of neural networks.
Recent works have placed emphasis on aspects such as improving performance by finding the optimal architecture or improving generalization.
There is still a significant gap between the current methodologies and the requirements for an effective deployment of lipreading in practical scenarios.
We propose a series of innovations that significantly bridge that gap: first, we raise the state-of-the-art performance by a wide margin on LRW and LRW-1000 to 88.5% and 46.6%, respectively using self-distillation.
arXiv Detail & Related papers (2020-07-13T16:56:27Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.