Pruning Pre-trained Language Models Without Fine-Tuning
- URL: http://arxiv.org/abs/2210.06210v2
- Date: Tue, 16 May 2023 06:24:43 GMT
- Title: Pruning Pre-trained Language Models Without Fine-Tuning
- Authors: Ting Jiang, Deqing Wang, Fuzhen Zhuang, Ruobing Xie, Feng Xia
- Abstract summary: We argue fine-tuning is redundant for first-order pruning, since first-order pruning is sufficient to converge PLMs to downstream tasks without fine-tuning.
Under this motivation, we propose Static Model Pruning (SMP), which only uses first-order pruning to adapt PLMs to downstream tasks while achieving the target sparsity level.
- Score: 42.54071630668426
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To overcome the overparameterized problem in Pre-trained Language Models
(PLMs), pruning is widely used as a simple and straightforward compression
method by directly removing unimportant weights. Previous first-order methods
successfully compress PLMs to extremely high sparsity with little performance
drop. These methods, such as movement pruning, use first-order information to
prune PLMs while fine-tuning the remaining weights. In this work, we argue
fine-tuning is redundant for first-order pruning, since first-order pruning is
sufficient to converge PLMs to downstream tasks without fine-tuning. Under this
motivation, we propose Static Model Pruning (SMP), which only uses first-order
pruning to adapt PLMs to downstream tasks while achieving the target sparsity
level. In addition, we also design a new masking function and training
objective to further improve SMP. Extensive experiments at various sparsity
levels show SMP has significant improvements over first-order and zero-order
methods. Unlike previous first-order methods, SMP is also applicable to low
sparsity and outperforms zero-order methods. Meanwhile, SMP is more parameter
efficient than other methods due to it does not require fine-tuning.
Related papers
- Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models.
We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.
Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models [53.638791265113625]
Sparsity-Preserved efficient fine-tuning method for large language models.
Code will be made available at https://github.com/Lucky-Lance/SPP.
arXiv Detail & Related papers (2024-05-25T04:55:27Z) - Just CHOP: Embarrassingly Simple LLM Compression [27.64461490974072]
Large language models (LLMs) enable unparalleled few- and zero-shot reasoning capabilities but at a high computational footprint.
We show that simple layer pruning coupled with an extended language model pretraining produces state-of-the-art results against structured and even semi-structured compression of models at a 7B scale.
We also show how distillation, which has been super effective in task-agnostic compression of smaller BERT-style models, becomes inefficient against our simple pruning technique.
arXiv Detail & Related papers (2023-05-24T08:18:35Z) - NoisyTune: A Little Noise Can Help You Finetune Pretrained Language
Models Better [98.5705258907774]
Finetuning pretrained language models (PLMs) is critical for their success in downstream tasks.
PLMs may have risks in overfitting pretraining signals, and there are gaps between downstream tasks and the pretraining tasks.
NoisyTune can help better finetune PLMs in downstream tasks by adding some noise to the parameters of PLMs before finetuning.
arXiv Detail & Related papers (2022-02-24T11:08:02Z) - Know What You Don't Need: Single-Shot Meta-Pruning for Attention Heads [114.77890059625162]
We propose a method, called Single-Shot Meta-Pruning, to compress deep pre-trained Transformers before fine-tuning.
We focus on pruning unnecessary attention heads adaptively for different downstream tasks.
Compared with existing compression methods for pre-trained models, our method can reduce the overhead of both fine-tuning and inference.
arXiv Detail & Related papers (2020-11-07T12:58:37Z) - Movement Pruning: Adaptive Sparsity by Fine-Tuning [115.91907953454034]
Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning.
We propose the use of movement pruning, a simple, deterministic first-order weight pruning method.
Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes.
arXiv Detail & Related papers (2020-05-15T17:54:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.