Sparse Progressive Distillation: Resolving Overfitting under
  Pretrain-and-Finetune Paradigm
        - URL: http://arxiv.org/abs/2110.08190v2
- Date: Mon, 18 Oct 2021 19:56:35 GMT
- Title: Sparse Progressive Distillation: Resolving Overfitting under
  Pretrain-and-Finetune Paradigm
- Authors: Shaoyi Huang, Dongkuan Xu, Ian E.H. Yen, Sung-en Chang, Bingbing Li,
  Shiyang Chen, Mimi Xie, Hang Liu, Caiwen Ding
- Abstract summary: Various pruning approaches have been proposed to reduce the footprint requirements of Transformer-based language models.
We show for the first time that reducing the risk of overfitting can help the effectiveness of pruning under the pretrain-and-finetune paradigm.
- Score: 7.662952656290564
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Various pruning approaches have been proposed to reduce the footprint
requirements of Transformer-based language models. Conventional wisdom is that
pruning reduces the model expressiveness and thus is more likely to underfit
than overfit compared to the original model. However, under the trending
pretrain-and-finetune paradigm, we argue that pruning increases the risk of
overfitting if pruning was performed at the fine-tuning phase, as it increases
the amount of information a model needs to learn from the downstream task,
resulting in relative data deficiency. In this paper, we aim to address the
overfitting issue under the pretrain-and-finetune paradigm to improve pruning
performance via progressive knowledge distillation (KD) and sparse pruning.
Furthermore, to mitigate the interference between different strategies of
learning rate, pruning and distillation, we propose a three-stage learning
framework. We show for the first time that reducing the risk of overfitting can
help the effectiveness of pruning under the pretrain-and-finetune paradigm.
Experiments on multiple datasets of GLUE benchmark show that our method
achieves highly competitive pruning performance over the state-of-the-art
competitors across different pruning ratio constraints.
 
      
        Related papers
        - Improved Methods for Model Pruning and Knowledge Distillation [3.8993503758122663]
 MAMA Pruning is a performance optimization technique for large language models like R1 or o3-mini.<n>It effectively reduces model size and computational complexity while maintaining performance comparable to the original unpruned model even at extreme pruned levels.<n>Preliminary experimental results show that our method outperforms and be comparable to state-of-the-art methods across various pruning levels and different downstream computational linguistics tasks.
 arXiv  Detail & Related papers  (2025-05-20T07:53:40Z)
- IDEA Prune: An Integrated Enlarge-and-Prune Pipeline in Generative   Language Model Pretraining [50.53912352342753]
 We propose an integrated enlarge-and-prune pipeline, which combines enlarge model training, pruning, and recovery.
We conduct experiments on compressing 2.8B models to 1.3B with up to 2T tokens in pretraining.
It demonstrates the integrated approach not only provides insights into the token efficiency of enlarged model pretraining but also achieves superior performance of pruned models.
 arXiv  Detail & Related papers  (2025-03-07T20:35:31Z)
- Pruning for Sparse Diffusion Models based on Gradient Flow [5.45577871017303]
 Diffusion Models (DMs) have impressive capabilities among generation models, but are limited to slower inference speeds and higher computational costs.
Previous works utilize one-shot structure pruning to derive lightweight DMs from pre-trained ones, but this approach often leads to a significant drop in generation quality.
We propose a iterative pruning method based on gradient flow, including the gradient flow pruning process and the gradient flow pruning criterion.
 arXiv  Detail & Related papers  (2025-01-16T10:55:05Z)
- Enhancing Robustness of Vision-Language Models through Orthogonality   Learning and Self-Regularization [77.62516752323207]
 We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization.
A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR.
For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
 arXiv  Detail & Related papers  (2024-07-11T10:35:53Z)
- Gradient-based Intra-attention Pruning on Pre-trained Language Models [21.444503777215637]
 We propose a structured pruning method GRAIN (Gradient-based Intra-attention pruning)
GRAIN inspects and prunes intra-attention structures, which greatly expands the structure search space and enables more flexible models.
Experiments on GLUE, SQuAD, and CoNLL 2003 show that GRAIN notably outperforms other methods, especially in the high sparsity regime.
 arXiv  Detail & Related papers  (2022-12-15T06:52:31Z)
- Incremental Prototype Prompt-tuning with Pre-trained Representation for
  Class Incremental Learning [4.717066668969749]
 Class incremental learning has attracted much attention, but most existing works still continually fine-tune the representation model.
We take the pre-train-and-prompt-tuning paradigm to sequentially learn new visual concepts based on a fixed semantic rich pre-trained representation model.
Our method consistently outperforms other state-of-the-art methods with a large margin.
 arXiv  Detail & Related papers  (2022-04-07T12:49:14Z)
- Prospect Pruning: Finding Trainable Weights at Initialization using
  Meta-Gradients [36.078414964088196]
 Pruning neural networks at initialization would enable us to find sparse models that retain the accuracy of the original network.
Current methods are insufficient to enable this optimization and lead to a large degradation in model performance.
We propose Prospect Pruning (ProsPr), which uses meta-gradients through the first few steps of optimization to determine which weights to prune.
Our method achieves state-of-the-art pruning performance on a variety of vision classification tasks, with less data and in a single shot compared to existing pruning-at-initialization methods.
 arXiv  Detail & Related papers  (2022-02-16T15:18:55Z)
- Regularizing Variational Autoencoder with Diversity and Uncertainty
  Awareness [61.827054365139645]
 Variational Autoencoder (VAE) approximates the posterior of latent variables based on amortized variational inference.
We propose an alternative model, DU-VAE, for learning a more Diverse and less Uncertain latent space.
 arXiv  Detail & Related papers  (2021-10-24T07:58:13Z)
- Unleashing the Power of Contrastive Self-Supervised Visual Models via
  Contrast-Regularized Fine-Tuning [94.35586521144117]
 We investigate whether applying contrastive learning to fine-tuning would bring further benefits.
We propose Contrast-regularized tuning (Core-tuning), a novel approach for fine-tuning contrastive self-supervised visual models.
 arXiv  Detail & Related papers  (2021-02-12T16:31:24Z)
- A Gradient Flow Framework For Analyzing Network Pruning [11.247894240593693]
 Recent network pruning methods focus on pruning models early-on in training.
We develop a general framework that uses gradient flow to unify importance measures through the norm of model parameters.
We validate our claims on several VGG-13, MobileNet-V1, and ResNet-56 models trained on CIFAR-10/CIFAR-100.
 arXiv  Detail & Related papers  (2020-09-24T17:37:32Z)
- Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
 We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
 arXiv  Detail & Related papers  (2020-06-10T08:22:41Z)
- Movement Pruning: Adaptive Sparsity by Fine-Tuning [115.91907953454034]
 Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning.
We propose the use of movement pruning, a simple, deterministic first-order weight pruning method.
 Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes.
 arXiv  Detail & Related papers  (2020-05-15T17:54:15Z)
- Learnable Bernoulli Dropout for Bayesian Deep Learning [53.79615543862426]
 Learnable Bernoulli dropout (LBD) is a new model-agnostic dropout scheme that considers the dropout rates as parameters jointly optimized with other model parameters.
LBD leads to improved accuracy and uncertainty estimates in image classification and semantic segmentation.
 arXiv  Detail & Related papers  (2020-02-12T18:57:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.