The Emergence of Essential Sparsity in Large Pre-trained Models: The
Weights that Matter
- URL: http://arxiv.org/abs/2306.03805v2
- Date: Wed, 9 Aug 2023 21:15:46 GMT
- Title: The Emergence of Essential Sparsity in Large Pre-trained Models: The
Weights that Matter
- Authors: Ajay Jaiswal, Shiwei Liu, Tianlong Chen, Zhangyang Wang
- Abstract summary: This paper studies induced sparse patterns across multiple large pre-trained vision and language transformers.
We propose the existence of essential sparsity defined with a sharp dropping point beyond which the performance declines much faster.
We also find essential sparsity to hold valid for N:M sparsity patterns as well as on modern-scale large language models.
- Score: 113.35761858962522
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large pre-trained transformers are show-stealer in modern-day deep learning,
and it becomes crucial to comprehend the parsimonious patterns that exist
within them as they grow in scale. With exploding parameter counts, Lottery
Ticket Hypothesis (LTH) and its variants, have lost their pragmatism in
sparsifying them due to high computation and memory bottleneck of repetitive
train-prune-retrain routine of iterative magnitude pruning (IMP) which worsens
with increasing model size. This paper comprehensively studies induced sparse
patterns across multiple large pre-trained vision and language transformers. We
propose the existence of -- essential sparsity defined with a sharp dropping
point beyond which the performance declines much faster w.r.t the rise of
sparsity level, when we directly remove weights with the smallest magnitudes in
one-shot without re-training. We also find essential sparsity to hold valid for
N:M sparsity patterns as well as on modern-scale large language models
(Vicuna-7B). We also present an intriguing emerging phenomenon of abrupt
sparsification during the pre-training of BERT, i.e., BERT suddenly becomes
heavily sparse in pre-training after certain iterations. Moreover, our
observations also indicate a counter-intuitive finding that BERT trained with a
larger amount of pre-training data tends to have a better ability to condense
knowledge in comparatively relatively fewer parameters. Lastly, we investigate
the effect of the pre-training loss on essential sparsity and discover that
self-supervised learning (SSL) objectives trigger stronger emergent
sparsification properties than supervised learning (SL). Our codes are
available at \url{https://github.com/VITA-Group/essential_sparsity}.
Related papers
- An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - Small-scale proxies for large-scale Transformer training instabilities [69.36381318171338]
We seek ways to reproduce and study training stability and instability at smaller scales.
By measuring the relationship between learning rate and loss across scales, we show that these instabilities also appear in small models when training at high learning rates.
We study methods such as warm-up, weight decay, and the $mu$Param to train small models that achieve similar losses across orders of magnitude of learning rate variation.
arXiv Detail & Related papers (2023-09-25T17:48:51Z) - HyperSparse Neural Networks: Shifting Exploration to Exploitation
through Adaptive Regularization [18.786142528591355]
Sparse neural networks are a key factor in developing resource-efficient machine learning applications.
We propose the novel and powerful sparse learning method Adaptive Regularized Training (ART) to compress dense into sparse networks.
Our method compresses the pre-trained model knowledge into the weights of highest magnitude.
arXiv Detail & Related papers (2023-08-14T14:18:11Z) - SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language
Models [4.114555639014612]
We show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training.
We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs.
arXiv Detail & Related papers (2023-03-18T17:56:01Z) - PLATON: Pruning Large Transformer Models with Upper Confidence Bound of
Weight Importance [114.1541203743303]
We propose PLATON, which captures the uncertainty of importance scores by upper confidence bound (UCB) of importance estimation.
We conduct extensive experiments with several Transformer-based models on natural language understanding, question answering and image classification.
arXiv Detail & Related papers (2022-06-25T05:38:39Z) - Parameter-Efficient Sparsity for Large Language Models Fine-Tuning [63.321205487234074]
We propose a.
sparse-efficient Sparse Training (PST) method to reduce the number of trainable parameters during sparse-aware training.
Experiments with diverse networks (i.e., BERT, RoBERTa and GPT-2) demonstrate PST performs on par or better than previous sparsity methods.
arXiv Detail & Related papers (2022-05-23T02:43:45Z) - An Empirical Investigation of the Role of Pre-training in Lifelong
Learning [21.995593026269578]
We show that generic pre-training implicitly alleviates the effects of catastrophic forgetting when learning multiple tasks sequentially.
We study this phenomenon by analyzing the loss landscape, finding that pre-trained weights appear to ease forgetting by leading to wider minima.
arXiv Detail & Related papers (2021-12-16T19:00:55Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - Provable Benefits of Overparameterization in Model Compression: From
Double Descent to Pruning Neural Networks [38.153825455980645]
Recent empirical evidence indicates that the practice of overization not only benefits training large models, but also assists - perhaps counterintuitively - building lightweight models.
This paper sheds light on these empirical findings by theoretically characterizing the high-dimensional toolsets of model pruning.
We analytically identify regimes in which, even if the location of the most informative features is known, we are better off fitting a large model and then pruning.
arXiv Detail & Related papers (2020-12-16T05:13:30Z) - Manifold attack [0.22419496088582863]
In this paper, we enforce the manifold preservation (manifold learning) from the original data into latent presentation.
We show that our approach of regularization provides improvements for the accuracy rate and for the robustness to adversarial examples.
arXiv Detail & Related papers (2020-09-13T09:39:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.