From Dense to Sparse: Contrastive Pruning for Better Pre-trained
Language Model Compression
- URL: http://arxiv.org/abs/2112.07198v1
- Date: Tue, 14 Dec 2021 07:14:09 GMT
- Title: From Dense to Sparse: Contrastive Pruning for Better Pre-trained
Language Model Compression
- Authors: Runxin Xu, Fuli Luo, Chengyu Wang, Baobao Chang, Jun Huang, Songfang
Huang, Fei Huang
- Abstract summary: ContrAstive Pruning (CAP) is designed as a general framework, compatible with both structured and unstructured pruning.
CAP consistently yields significant improvements, especially in extremely high sparsity scenarios.
- Score: 32.35855458528584
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained Language Models (PLMs) have achieved great success in various
Natural Language Processing (NLP) tasks under the pre-training and fine-tuning
paradigm. With large quantities of parameters, PLMs are computation-intensive
and resource-hungry. Hence, model pruning has been introduced to compress
large-scale PLMs. However, most prior approaches only consider task-specific
knowledge towards downstream tasks, but ignore the essential task-agnostic
knowledge during pruning, which may cause catastrophic forgetting problem and
lead to poor generalization ability. To maintain both task-agnostic and
task-specific knowledge in our pruned model, we propose ContrAstive Pruning
(CAP) under the paradigm of pre-training and fine-tuning. It is designed as a
general framework, compatible with both structured and unstructured pruning.
Unified in contrastive learning, CAP enables the pruned model to learn from the
pre-trained model for task-agnostic knowledge, and fine-tuned model for
task-specific knowledge. Besides, to better retain the performance of the
pruned model, the snapshots (i.e., the intermediate models at each pruning
iteration) also serve as effective supervisions for pruning. Our extensive
experiments show that adopting CAP consistently yields significant
improvements, especially in extremely high sparsity scenarios. With only 3%
model parameters reserved (i.e., 97% sparsity), CAP successfully achieves 99.2%
and 96.3% of the original BERT performance in QQP and MNLI tasks. In addition,
our probing experiments demonstrate that the model pruned by CAP tends to
achieve better generalization ability.
Related papers
- SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models [17.483183039447564]
This paper introduces Sparse Expert Activation Pruning (SEAP), a training-free pruning method that selectively retains task-relevant parameters to reduce inference overhead.
Experimental results demonstrate that SEAP significantly reduces computational overhead while maintaining competitive accuracy.
arXiv Detail & Related papers (2025-03-10T17:59:03Z) - The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models [69.798277882245]
We introduce Unsupervised Prefix Fine-Tuning (UPFT) to enhance large language models' reasoning efficiency.
UPFT removes the need for labeled data or exhaustive sampling.
Experiments show that UPFT matches the performance of supervised methods.
arXiv Detail & Related papers (2025-03-04T18:56:03Z) - Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization.
A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR.
For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z) - Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large
Language Models [46.92994945808424]
Catastrophic forgetting emerges as a critical challenge when fine-tuning multi-modal large language models (MLLMs)
This paper presents a comprehensive analysis of catastrophic forgetting in MLLMs and introduces a post-training adjustment method called Model Tailor.
arXiv Detail & Related papers (2024-02-19T11:02:05Z) - Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks [5.536630285985836]
We introduce parameter-efficient sparsity crafting (PESC)
PESC crafts dense models into sparse models using the mixture-of-experts (MoE) architecture.
Our best sparse model outperforms other sparse and dense models and exhibits superior general capabilities compared to GP3.5.
arXiv Detail & Related papers (2024-01-05T09:58:09Z) - An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - Uncertainty-aware Parameter-Efficient Self-training for Semi-supervised
Language Understanding [38.11411155621616]
We study self-training as one of the predominant semi-supervised learning approaches.
We present UPET, a novel Uncertainty-aware self-Training framework.
We show that UPET achieves a substantial improvement in terms of performance and efficiency.
arXiv Detail & Related papers (2023-10-19T02:18:29Z) - Making Pre-trained Language Models both Task-solvers and
Self-calibrators [52.98858650625623]
Pre-trained language models (PLMs) serve as backbones for various real-world systems.
Previous work shows that introducing an extra calibration task can mitigate this issue.
We propose a training algorithm LM-TOAST to tackle the challenges.
arXiv Detail & Related papers (2023-07-21T02:51:41Z) - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than
In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made.
parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.
In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z) - SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark
for Semantic and Generative Capabilities [76.97949110580703]
We introduce SUPERB-SG, a new benchmark to evaluate pre-trained models across various speech tasks.
We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain.
We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
arXiv Detail & Related papers (2022-03-14T04:26:40Z) - DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language
Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive.
We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights.
Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z) - On the Effect of Dropping Layers of Pre-trained Transformer Models [35.25025837133909]
We explore strategies to drop layers in pre-trained models, and observe the effect of pruning on downstream GLUE tasks.
We were able to prune BERT, RoBERTa and XLNet models up to 40%, while maintaining up to 98% of their original performance.
Our experiments yield interesting observations such as, (i) the lower layers are most critical to maintain downstream task performance, (ii) some tasks such as paraphrase detection and sentence similarity are more robust to the dropping of layers, and (iii) models trained using a different objective function exhibit different learning patterns and w.r.t the layer dropping
arXiv Detail & Related papers (2020-04-08T07:09:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.