SEVEN: Pruning Transformer Model by Reserving Sentinels
- URL: http://arxiv.org/abs/2403.12688v1
- Date: Tue, 19 Mar 2024 12:47:43 GMT
- Title: SEVEN: Pruning Transformer Model by Reserving Sentinels
- Authors: Jinying Xiao, Ping Li, Jie Nie, Zhe Tang,
- Abstract summary: Symbolic Descent (SD) is a general approach for training and fine-tuning Transformer models (TM)
SEVEN is introduced by us, which particularly favors weights with consistently high sensitivity, i.e., weights with small gradient noise.
The results demonstrate significant improvements of SEVEN in multiple pruning scenarios and across different sparsity levels.
- Score: 18.535687216213628
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale Transformer models (TM) have demonstrated outstanding performance across various tasks. However, their considerable parameter size restricts their applicability, particularly on mobile devices. Due to the dynamic and intricate nature of gradients on TM compared to Convolutional Neural Networks, commonly used pruning methods tend to retain weights with larger gradient noise. This results in pruned models that are sensitive to sparsity and datasets, exhibiting suboptimal performance. Symbolic Descent (SD) is a general approach for training and fine-tuning TM. In this paper, we attempt to describe the noisy batch gradient sequences on TM through the cumulative process of SD. We utilize this design to dynamically assess the importance scores of weights.SEVEN is introduced by us, which particularly favors weights with consistently high sensitivity, i.e., weights with small gradient noise. These weights are tended to be preserved by SEVEN. Extensive experiments on various TM in natural language, question-answering, and image classification domains are conducted to validate the effectiveness of SEVEN. The results demonstrate significant improvements of SEVEN in multiple pruning scenarios and across different sparsity levels. Additionally, SEVEN exhibits robust performance under various fine-tuning strategies. The code is publicly available at https://github.com/xiaojinying/SEVEN.
Related papers
- Understanding the robustness difference between stochastic gradient
descent and adaptive gradient methods [11.895321856533934]
gradient descent (SGD) and adaptive gradient methods have been widely used in training deep neural networks.
We empirically show that while the difference between the standard generalization performance of models trained using these methods is small, those trained using SGD exhibit far greater robustness under input perturbations.
arXiv Detail & Related papers (2023-08-13T07:03:22Z) - Sensi-BERT: Towards Sensitivity Driven Fine-Tuning for
Parameter-Efficient BERT [6.029590006321152]
We present Sensi-BERT, a sensitivity driven efficient fine-tuning of BERT models for downstream tasks.
Our experiments show the efficacy of Sensi-BERT across different downstream tasks including MNLI, QQP, QNLI, SST-2 and SQuAD.
arXiv Detail & Related papers (2023-07-14T17:24:15Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Demystify Transformers & Convolutions in Modern Image Deep Networks [82.32018252867277]
This paper aims to identify the real gains of popular convolution and attention operators through a detailed study.
We find that the key difference among these feature transformation modules, such as attention or convolution, lies in their spatial feature aggregation approach.
Our experiments on various tasks and an analysis of inductive bias show a significant performance boost due to advanced network-level and block-level designs.
arXiv Detail & Related papers (2022-11-10T18:59:43Z) - Improving Robustness with Adaptive Weight Decay [8.096469295357737]
We propose adaptive weight decay, which automatically tunes the hyper- parameter iteration for weight decay during each training.
We show that this simple modification can result in large improvements in robustness.
This method has other desirable properties, such as less sensitivity to learning rate, and smaller weight norms.
arXiv Detail & Related papers (2022-09-30T21:13:00Z) - PLATON: Pruning Large Transformer Models with Upper Confidence Bound of
Weight Importance [114.1541203743303]
We propose PLATON, which captures the uncertainty of importance scores by upper confidence bound (UCB) of importance estimation.
We conduct extensive experiments with several Transformer-based models on natural language understanding, question answering and image classification.
arXiv Detail & Related papers (2022-06-25T05:38:39Z) - Dynamic Collective Intelligence Learning: Finding Efficient Sparse Model
via Refined Gradients for Pruned Weights [31.68257673664519]
Dynamic pruning methods try to find diverse sparsity patterns during training by utilizing Straight-Through-Estimator (STE) to approximate gradients of pruned weights.
We introduce refined gradients to update the pruned weights by forming dual forwarding paths from two sets (pruned and unpruned) of weights.
We propose a novel Dynamic Collective Intelligence Learning (DCIL) which makes use of the learning synergy between the collective intelligence of both weight sets.
arXiv Detail & Related papers (2021-09-10T04:41:17Z) - AdamP: Slowing Down the Slowdown for Momentum Optimizers on
Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning.
It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights.
In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Dynamic Scale Training for Object Detection [111.33112051962514]
We propose a Dynamic Scale Training paradigm (abbreviated as DST) to mitigate scale variation challenge in object detection.
Experimental results demonstrate the efficacy of our proposed DST towards scale variation handling.
It does not introduce inference overhead and could serve as a free lunch for general detection configurations.
arXiv Detail & Related papers (2020-04-26T16:48:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.