Sparse Progressive Distillation: Resolving Overfitting under
Pretrain-and-Finetune Paradigm
- URL: http://arxiv.org/abs/2110.08190v2
- Date: Mon, 18 Oct 2021 19:56:35 GMT
- Title: Sparse Progressive Distillation: Resolving Overfitting under
Pretrain-and-Finetune Paradigm
- Authors: Shaoyi Huang, Dongkuan Xu, Ian E.H. Yen, Sung-en Chang, Bingbing Li,
Shiyang Chen, Mimi Xie, Hang Liu, Caiwen Ding
- Abstract summary: Various pruning approaches have been proposed to reduce the footprint requirements of Transformer-based language models.
We show for the first time that reducing the risk of overfitting can help the effectiveness of pruning under the pretrain-and-finetune paradigm.
- Score: 7.662952656290564
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Various pruning approaches have been proposed to reduce the footprint
requirements of Transformer-based language models. Conventional wisdom is that
pruning reduces the model expressiveness and thus is more likely to underfit
than overfit compared to the original model. However, under the trending
pretrain-and-finetune paradigm, we argue that pruning increases the risk of
overfitting if pruning was performed at the fine-tuning phase, as it increases
the amount of information a model needs to learn from the downstream task,
resulting in relative data deficiency. In this paper, we aim to address the
overfitting issue under the pretrain-and-finetune paradigm to improve pruning
performance via progressive knowledge distillation (KD) and sparse pruning.
Furthermore, to mitigate the interference between different strategies of
learning rate, pruning and distillation, we propose a three-stage learning
framework. We show for the first time that reducing the risk of overfitting can
help the effectiveness of pruning under the pretrain-and-finetune paradigm.
Experiments on multiple datasets of GLUE benchmark show that our method
achieves highly competitive pruning performance over the state-of-the-art
competitors across different pruning ratio constraints.
Related papers
- Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization.
A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR.
For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z) - Gradient-based Intra-attention Pruning on Pre-trained Language Models [21.444503777215637]
We propose a structured pruning method GRAIN (Gradient-based Intra-attention pruning)
GRAIN inspects and prunes intra-attention structures, which greatly expands the structure search space and enables more flexible models.
Experiments on GLUE, SQuAD, and CoNLL 2003 show that GRAIN notably outperforms other methods, especially in the high sparsity regime.
arXiv Detail & Related papers (2022-12-15T06:52:31Z) - Incremental Prototype Prompt-tuning with Pre-trained Representation for
Class Incremental Learning [4.717066668969749]
Class incremental learning has attracted much attention, but most existing works still continually fine-tune the representation model.
We take the pre-train-and-prompt-tuning paradigm to sequentially learn new visual concepts based on a fixed semantic rich pre-trained representation model.
Our method consistently outperforms other state-of-the-art methods with a large margin.
arXiv Detail & Related papers (2022-04-07T12:49:14Z) - Prospect Pruning: Finding Trainable Weights at Initialization using
Meta-Gradients [36.078414964088196]
Pruning neural networks at initialization would enable us to find sparse models that retain the accuracy of the original network.
Current methods are insufficient to enable this optimization and lead to a large degradation in model performance.
We propose Prospect Pruning (ProsPr), which uses meta-gradients through the first few steps of optimization to determine which weights to prune.
Our method achieves state-of-the-art pruning performance on a variety of vision classification tasks, with less data and in a single shot compared to existing pruning-at-initialization methods.
arXiv Detail & Related papers (2022-02-16T15:18:55Z) - Regularizing Variational Autoencoder with Diversity and Uncertainty
Awareness [61.827054365139645]
Variational Autoencoder (VAE) approximates the posterior of latent variables based on amortized variational inference.
We propose an alternative model, DU-VAE, for learning a more Diverse and less Uncertain latent space.
arXiv Detail & Related papers (2021-10-24T07:58:13Z) - Unleashing the Power of Contrastive Self-Supervised Visual Models via
Contrast-Regularized Fine-Tuning [94.35586521144117]
We investigate whether applying contrastive learning to fine-tuning would bring further benefits.
We propose Contrast-regularized tuning (Core-tuning), a novel approach for fine-tuning contrastive self-supervised visual models.
arXiv Detail & Related papers (2021-02-12T16:31:24Z) - A Gradient Flow Framework For Analyzing Network Pruning [11.247894240593693]
Recent network pruning methods focus on pruning models early-on in training.
We develop a general framework that uses gradient flow to unify importance measures through the norm of model parameters.
We validate our claims on several VGG-13, MobileNet-V1, and ResNet-56 models trained on CIFAR-10/CIFAR-100.
arXiv Detail & Related papers (2020-09-24T17:37:32Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Movement Pruning: Adaptive Sparsity by Fine-Tuning [115.91907953454034]
Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning.
We propose the use of movement pruning, a simple, deterministic first-order weight pruning method.
Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes.
arXiv Detail & Related papers (2020-05-15T17:54:15Z) - Learnable Bernoulli Dropout for Bayesian Deep Learning [53.79615543862426]
Learnable Bernoulli dropout (LBD) is a new model-agnostic dropout scheme that considers the dropout rates as parameters jointly optimized with other model parameters.
LBD leads to improved accuracy and uncertainty estimates in image classification and semantic segmentation.
arXiv Detail & Related papers (2020-02-12T18:57:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.