PLATON: Pruning Large Transformer Models with Upper Confidence Bound of
Weight Importance
- URL: http://arxiv.org/abs/2206.12562v1
- Date: Sat, 25 Jun 2022 05:38:39 GMT
- Title: PLATON: Pruning Large Transformer Models with Upper Confidence Bound of
Weight Importance
- Authors: Qingru Zhang, Simiao Zuo, Chen Liang, Alexander Bukharin, Pengcheng
He, Weizhu Chen, Tuo Zhao
- Abstract summary: We propose PLATON, which captures the uncertainty of importance scores by upper confidence bound (UCB) of importance estimation.
We conduct extensive experiments with several Transformer-based models on natural language understanding, question answering and image classification.
- Score: 114.1541203743303
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Transformer-based models have exhibited superior performance in various
natural language processing and computer vision tasks. However, these models
contain enormous amounts of parameters, which restrict their deployment to
real-world applications. To reduce the model size, researchers prune these
models based on the weights' importance scores. However, such scores are
usually estimated on mini-batches during training, which incurs large
variability/uncertainty due to mini-batch sampling and complicated training
dynamics. As a result, some crucial weights could be pruned by commonly used
pruning methods because of such uncertainty, which makes training unstable and
hurts generalization. To resolve this issue, we propose PLATON, which captures
the uncertainty of importance scores by upper confidence bound (UCB) of
importance estimation. In particular, for the weights with low importance
scores but high uncertainty, PLATON tends to retain them and explores their
capacity. We conduct extensive experiments with several Transformer-based
models on natural language understanding, question answering and image
classification to validate the effectiveness of PLATON. Results demonstrate
that PLATON manifests notable improvement under different sparsity levels. Our
code is publicly available at https://github.com/QingruZhang/PLATON.
Related papers
- Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners [19.579098962615795]
Few-Shot Class Incremental Learning (FSCIL) is a task that requires a model to learn new classes incrementally without forgetting when only a few samples for each class are given.
FSCIL encounters two significant challenges: catastrophic forgetting and overfitting.
We argue that large models such as vision and language transformers pre-trained on large datasets can be excellent few-shot incremental learners.
arXiv Detail & Related papers (2024-04-02T17:23:22Z) - SEVEN: Pruning Transformer Model by Reserving Sentinels [18.535687216213628]
Symbolic Descent (SD) is a general approach for training and fine-tuning Transformer models (TM)
SEVEN is introduced by us, which particularly favors weights with consistently high sensitivity, i.e., weights with small gradient noise.
The results demonstrate significant improvements of SEVEN in multiple pruning scenarios and across different sparsity levels.
arXiv Detail & Related papers (2024-03-19T12:47:43Z) - The Impact of Quantization on the Robustness of Transformer-based Text
Classifiers [5.281054432963503]
This work is the first application of quantization on the robustness of NLP models.
We evaluate the impact of quantization on BERT and DistilBERT models in text classification using SST-2, Emotion, and MR datasets.
Our experiments indicate that quantization increases the robustness of the model by 18.80% on average compared to adversarial training.
arXiv Detail & Related papers (2024-03-08T14:55:05Z) - Retrieval-based Knowledge Transfer: An Effective Approach for Extreme
Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks.
However, the massive size of these models poses huge challenges for their deployment in real-world applications.
We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z) - Uncertainty-aware Parameter-Efficient Self-training for Semi-supervised
Language Understanding [38.11411155621616]
We study self-training as one of the predominant semi-supervised learning approaches.
We present UPET, a novel Uncertainty-aware self-Training framework.
We show that UPET achieves a substantial improvement in terms of performance and efficiency.
arXiv Detail & Related papers (2023-10-19T02:18:29Z) - The Emergence of Essential Sparsity in Large Pre-trained Models: The
Weights that Matter [113.35761858962522]
This paper studies induced sparse patterns across multiple large pre-trained vision and language transformers.
We propose the existence of essential sparsity defined with a sharp dropping point beyond which the performance declines much faster.
We also find essential sparsity to hold valid for N:M sparsity patterns as well as on modern-scale large language models.
arXiv Detail & Related papers (2023-06-06T15:49:09Z) - On Robustness of Finetuned Transformer-based NLP Models [11.063628128069736]
We characterize changes between pretrained and finetuned language model representations across layers using two metrics: CKA and STIR.
GPT-2 representations are more robust than BERT and T5 across multiple types of input perturbations.
This study provides valuable insights into perturbation-specific weaknesses of popular Transformer-based models.
arXiv Detail & Related papers (2023-05-23T18:25:18Z) - FairIF: Boosting Fairness in Deep Learning via Influence Functions with
Validation Set Sensitive Attributes [51.02407217197623]
We propose a two-stage training algorithm named FAIRIF.
It minimizes the loss over the reweighted data set where the sample weights are computed.
We show that FAIRIF yields models with better fairness-utility trade-offs against various types of bias.
arXiv Detail & Related papers (2022-01-15T05:14:48Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z) - MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models.
We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student.
Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.