Related papers: Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

URL: http://arxiv.org/abs/2110.08190v2
Date: Mon, 18 Oct 2021 19:56:35 GMT
Title: Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm
Authors: Shaoyi Huang, Dongkuan Xu, Ian E.H. Yen, Sung-en Chang, Bingbing Li, Shiyang Chen, Mimi Xie, Hang Liu, Caiwen Ding
Abstract summary: Various pruning approaches have been proposed to reduce the footprint requirements of Transformer-based language models. We show for the first time that reducing the risk of overfitting can help the effectiveness of pruning under the pretrain-and-finetune paradigm.
Score: 7.662952656290564
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Various pruning approaches have been proposed to reduce the footprint requirements of Transformer-based language models. Conventional wisdom is that pruning reduces the model expressiveness and thus is more likely to underfit than overfit compared to the original model. However, under the trending pretrain-and-finetune paradigm, we argue that pruning increases the risk of overfitting if pruning was performed at the fine-tuning phase, as it increases the amount of information a model needs to learn from the downstream task, resulting in relative data deficiency. In this paper, we aim to address the overfitting issue under the pretrain-and-finetune paradigm to improve pruning performance via progressive knowledge distillation (KD) and sparse pruning. Furthermore, to mitigate the interference between different strategies of learning rate, pruning and distillation, we propose a three-stage learning framework. We show for the first time that reducing the risk of overfitting can help the effectiveness of pruning under the pretrain-and-finetune paradigm. Experiments on multiple datasets of GLUE benchmark show that our method achieves highly competitive pruning performance over the state-of-the-art competitors across different pruning ratio constraints.

Related papers

Improved Methods for Model Pruning and Knowledge Distillation [3.8993503758122663]
MAMA Pruning is a performance optimization technique for large language models like R1 or o3-mini.<n>It effectively reduces model size and computational complexity while maintaining performance comparable to the original unpruned model even at extreme pruned levels.<n>Preliminary experimental results show that our method outperforms and be comparable to state-of-the-art methods across various pruning levels and different downstream computational linguistics tasks.
arXiv Detail & Related papers (2025-05-20T07:53:40Z)
IDEA Prune: An Integrated Enlarge-and-Prune Pipeline in Generative Language Model Pretraining [50.53912352342753]
We propose an integrated enlarge-and-prune pipeline, which combines enlarge model training, pruning, and recovery. We conduct experiments on compressing 2.8B models to 1.3B with up to 2T tokens in pretraining. It demonstrates the integrated approach not only provides insights into the token efficiency of enlarged model pretraining but also achieves superior performance of pruned models.
arXiv Detail & Related papers (2025-03-07T20:35:31Z)
Pruning for Sparse Diffusion Models based on Gradient Flow [5.45577871017303]
Diffusion Models (DMs) have impressive capabilities among generation models, but are limited to slower inference speeds and higher computational costs. Previous works utilize one-shot structure pruning to derive lightweight DMs from pre-trained ones, but this approach often leads to a significant drop in generation quality. We propose a iterative pruning method based on gradient flow, including the gradient flow pruning process and the gradient flow pruning criterion.
arXiv Detail & Related papers (2025-01-16T10:55:05Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)
Gradient-based Intra-attention Pruning on Pre-trained Language Models [21.444503777215637]
We propose a structured pruning method GRAIN (Gradient-based Intra-attention pruning) GRAIN inspects and prunes intra-attention structures, which greatly expands the structure search space and enables more flexible models. Experiments on GLUE, SQuAD, and CoNLL 2003 show that GRAIN notably outperforms other methods, especially in the high sparsity regime.
arXiv Detail & Related papers (2022-12-15T06:52:31Z)
Incremental Prototype Prompt-tuning with Pre-trained Representation for Class Incremental Learning [4.717066668969749]
Class incremental learning has attracted much attention, but most existing works still continually fine-tune the representation model. We take the pre-train-and-prompt-tuning paradigm to sequentially learn new visual concepts based on a fixed semantic rich pre-trained representation model. Our method consistently outperforms other state-of-the-art methods with a large margin.
arXiv Detail & Related papers (2022-04-07T12:49:14Z)
Prospect Pruning: Finding Trainable Weights at Initialization using Meta-Gradients [36.078414964088196]
Pruning neural networks at initialization would enable us to find sparse models that retain the accuracy of the original network. Current methods are insufficient to enable this optimization and lead to a large degradation in model performance. We propose Prospect Pruning (ProsPr), which uses meta-gradients through the first few steps of optimization to determine which weights to prune. Our method achieves state-of-the-art pruning performance on a variety of vision classification tasks, with less data and in a single shot compared to existing pruning-at-initialization methods.
arXiv Detail & Related papers (2022-02-16T15:18:55Z)
Regularizing Variational Autoencoder with Diversity and Uncertainty Awareness [61.827054365139645]
Variational Autoencoder (VAE) approximates the posterior of latent variables based on amortized variational inference. We propose an alternative model, DU-VAE, for learning a more Diverse and less Uncertain latent space.
arXiv Detail & Related papers (2021-10-24T07:58:13Z)
Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regularized Fine-Tuning [94.35586521144117]
We investigate whether applying contrastive learning to fine-tuning would bring further benefits. We propose Contrast-regularized tuning (Core-tuning), a novel approach for fine-tuning contrastive self-supervised visual models.
arXiv Detail & Related papers (2021-02-12T16:31:24Z)
A Gradient Flow Framework For Analyzing Network Pruning [11.247894240593693]
Recent network pruning methods focus on pruning models early-on in training. We develop a general framework that uses gradient flow to unify importance measures through the norm of model parameters. We validate our claims on several VGG-13, MobileNet-V1, and ResNet-56 models trained on CIFAR-10/CIFAR-100.
arXiv Detail & Related papers (2020-09-24T17:37:32Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Movement Pruning: Adaptive Sparsity by Fine-Tuning [115.91907953454034]
Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning. We propose the use of movement pruning, a simple, deterministic first-order weight pruning method. Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes.
arXiv Detail & Related papers (2020-05-15T17:54:15Z)
Learnable Bernoulli Dropout for Bayesian Deep Learning [53.79615543862426]
Learnable Bernoulli dropout (LBD) is a new model-agnostic dropout scheme that considers the dropout rates as parameters jointly optimized with other model parameters. LBD leads to improved accuracy and uncertainty estimates in image classification and semantic segmentation.
arXiv Detail & Related papers (2020-02-12T18:57:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.