Know What You Don't Need: Single-Shot Meta-Pruning for Attention Heads
- URL: http://arxiv.org/abs/2011.03770v1
- Date: Sat, 7 Nov 2020 12:58:37 GMT
- Title: Know What You Don't Need: Single-Shot Meta-Pruning for Attention Heads
- Authors: Zhengyan Zhang, Fanchao Qi, Zhiyuan Liu, Qun Liu, Maosong Sun
- Abstract summary: We propose a method, called Single-Shot Meta-Pruning, to compress deep pre-trained Transformers before fine-tuning.
We focus on pruning unnecessary attention heads adaptively for different downstream tasks.
Compared with existing compression methods for pre-trained models, our method can reduce the overhead of both fine-tuning and inference.
- Score: 114.77890059625162
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep pre-trained Transformer models have achieved state-of-the-art results
over a variety of natural language processing (NLP) tasks. By learning rich
language knowledge with millions of parameters, these models are usually
overparameterized and significantly increase the computational overhead in
applications. It is intuitive to address this issue by model compression. In
this work, we propose a method, called Single-Shot Meta-Pruning, to compress
deep pre-trained Transformers before fine-tuning. Specifically, we focus on
pruning unnecessary attention heads adaptively for different downstream tasks.
To measure the informativeness of attention heads, we train our Single-Shot
Meta-Pruner (SMP) with a meta-learning paradigm aiming to maintain the
distribution of text representations after pruning. Compared with existing
compression methods for pre-trained models, our method can reduce the overhead
of both fine-tuning and inference. Experimental results show that our pruner
can selectively prune 50% of attention heads with little impact on the
performance on downstream tasks and even provide better text representations.
The source code will be released in the future.
Related papers
- A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV)
Model compression methods reduce the memory and computational cost of Transformer.
This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z) - On the Effectiveness of LayerNorm Tuning for Continual Learning in
Vision Transformers [47.77328392236625]
State-of-the-art rehearsal-free continual learning methods exploit the peculiarities of Vision Transformers to learn task-specific prompts.
We introduce a two-stage training procedure, where we first optimize the task-specific parameters and then train the classifier with the same selection procedure of the inference time.
Our method achieves results that are either superior or on par with the state of the art while being computationally cheaper.
arXiv Detail & Related papers (2023-08-18T15:11:16Z) - Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers [29.319666323947708]
We present a novel approach that dynamically prunes contextual information while preserving the model's expressiveness.
Our method employs a learnable mechanism that determines which uninformative tokens can be dropped from the context.
Our reference implementation achieves up to $2times$ increase in inference throughput and even greater memory savings.
arXiv Detail & Related papers (2023-05-25T07:39:41Z) - PVP: Pre-trained Visual Parameter-Efficient Tuning [29.05396521860764]
Large-scale pre-trained transformers have demonstrated remarkable success in various computer vision tasks.
It is still highly challenging to fully fine-tune these models for downstream tasks due to their high computational and storage costs.
We propose a Pre-trained Visual.
efficient (PVP) Tuning framework, which pre-trains the parameter-efficient tuning modules first and then leverages the pre-trained modules.
arXiv Detail & Related papers (2023-04-26T15:55:29Z) - Voting from Nearest Tasks: Meta-Vote Pruning of Pre-trained Models for
Downstream Tasks [55.431048995662714]
We create a small model for a new task from the pruned models of similar tasks.
We show that a few fine-tuning steps on this model suffice to produce a promising pruned-model for the new task.
We develop a simple but effective ''Meta-Vote Pruning (MVP)'' method that significantly reduces the pruning iterations for a new task.
arXiv Detail & Related papers (2023-01-27T06:49:47Z) - Pruning Pre-trained Language Models Without Fine-Tuning [42.54071630668426]
We argue fine-tuning is redundant for first-order pruning, since first-order pruning is sufficient to converge PLMs to downstream tasks without fine-tuning.
Under this motivation, we propose Static Model Pruning (SMP), which only uses first-order pruning to adapt PLMs to downstream tasks while achieving the target sparsity level.
arXiv Detail & Related papers (2022-10-12T13:58:38Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - Dynamic Scale Training for Object Detection [111.33112051962514]
We propose a Dynamic Scale Training paradigm (abbreviated as DST) to mitigate scale variation challenge in object detection.
Experimental results demonstrate the efficacy of our proposed DST towards scale variation handling.
It does not introduce inference overhead and could serve as a free lunch for general detection configurations.
arXiv Detail & Related papers (2020-04-26T16:48:17Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.