Related papers: Gradient-based Intra-attention Pruning on Pre-trained Language Models

Gradient-based Intra-attention Pruning on Pre-trained Language Models

URL: http://arxiv.org/abs/2212.07634v2
Date: Thu, 18 May 2023 14:41:38 GMT
Title: Gradient-based Intra-attention Pruning on Pre-trained Language Models
Authors: Ziqing Yang, Yiming Cui, Xin Yao, Shijin Wang
Abstract summary: We propose a structured pruning method GRAIN (Gradient-based Intra-attention pruning) GRAIN inspects and prunes intra-attention structures, which greatly expands the structure search space and enables more flexible models. Experiments on GLUE, SQuAD, and CoNLL 2003 show that GRAIN notably outperforms other methods, especially in the high sparsity regime.
Score: 21.444503777215637
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-trained language models achieve superior performance but are computationally expensive. Techniques such as pruning and knowledge distillation have been developed to reduce their sizes and latencies. In this work, we propose a structured pruning method GRAIN (Gradient-based Intra-attention pruning), which performs task-specific pruning with knowledge distillation and yields highly effective models. Different from common approaches that prune each attention head as a whole, GRAIN inspects and prunes intra-attention structures, which greatly expands the structure search space and enables more flexible models. We also propose a gradient separation strategy that reduces the interference of distillation on pruning for a better combination of the two approaches. Experiments on GLUE, SQuAD, and CoNLL 2003 show that GRAIN notably outperforms other methods, especially in the high sparsity regime, and achieves $6\sim7\times$ speedups while maintaining $93\%\sim99\%$ performance. Under extreme compression where only $3\%$ transformer weights remain, the pruned model is still competitive compared to larger models.

Related papers

IDEA Prune: An Integrated Enlarge-and-Prune Pipeline in Generative Language Model Pretraining [50.53912352342753]
We propose an integrated enlarge-and-prune pipeline, which combines enlarge model training, pruning, and recovery. We conduct experiments on compressing 2.8B models to 1.3B with up to 2T tokens in pretraining. It demonstrates the integrated approach not only provides insights into the token efficiency of enlarged model pretraining but also achieves superior performance of pruned models.
arXiv Detail & Related papers (2025-03-07T20:35:31Z)
Singular Value Scaling: Efficient Generative Model Compression via Pruned Weights Refinement [9.454314879815337]
generative models often exhibit dominant singular vectors, hindering fine-tuning efficiency and leading to suboptimal performance. We introduce Singular Value Scaling (SVS), a versatile technique for refining pruned weights, applicable to both model types. SVS improves compression performance across model types without additional training costs.
arXiv Detail & Related papers (2024-12-23T08:40:08Z)
MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router [55.88046193872355]
Mixture-of-Experts (MoE) architectures face challenges such as high memory consumption and redundancy in experts. We propose MoE-Pruner, a method that prunes weights with the smallest magnitudes multiplied by the corresponding input activations and router weights. Our pruning method is one-shot, requiring no retraining or weight updates.
arXiv Detail & Related papers (2024-10-15T19:22:27Z)
Effective Layer Pruning Through Similarity Metric Perspective [0.0]
Deep neural networks have been the predominant paradigm in machine learning for solving cognitive tasks. Pruning structures from these models is a straightforward approach to reducing network complexity. Layer pruning often hurts the network predictive ability (i.e., accuracy) at high compression rates. This work introduces an effective layer-pruning strategy that meets all underlying properties pursued by pruning methods.
arXiv Detail & Related papers (2024-05-27T11:54:51Z)
LaCo: Large Language Model Pruning via Layer Collapse [56.92068213969036]
Large language models (LLMs) based on transformer are witnessing a notable trend of size expansion. Existing methods such as model quantization, knowledge distillation, and model pruning are constrained by various issues. We propose a concise layer-wise structured pruner called textitLayer Collapse (LaCo), in which rear model layers collapse into a prior layer.
arXiv Detail & Related papers (2024-02-17T04:16:30Z)
Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers [15.27677493050638]
N:M structured sparsity has garnered significant interest as a result of relatively modest overhead and improved efficiency. There have been efforts to develop training recipes for N:M structured sparsity, they primarily focus on low-sparsity regions. However, performance of models trained using these approaches tends to decline when confronted with high-sparsity regions.
arXiv Detail & Related papers (2024-02-07T10:55:59Z)
Structured Pruning Learns Compact and Accurate Models [28.54826400747667]
We propose a task-specific structured pruning method CoFi (Coarse- and Fine-grained Pruning) CoFi delivers highly parallelizableworks and matches the distillation methods in both accuracy and latency. Our experiments on GLUE and SQuAD datasets show that CoFi yields models with over 10x speedups with a small accuracy drop.
arXiv Detail & Related papers (2022-04-01T13:09:56Z)
Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm [7.662952656290564]
Various pruning approaches have been proposed to reduce the footprint requirements of Transformer-based language models. We show for the first time that reducing the risk of overfitting can help the effectiveness of pruning under the pretrain-and-finetune paradigm.
arXiv Detail & Related papers (2021-10-15T16:42:56Z)
Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models. Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely. Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z)
Block Pruning For Faster Transformers [89.70392810063247]
We introduce a block pruning approach targeting both small and fast models. We find that this approach learns to prune out full components of the underlying model, such as attention heads.
arXiv Detail & Related papers (2021-09-10T12:46:32Z)
One-Cycle Pruning: Pruning ConvNets Under a Tight Training Budget [0.0]
Introducing sparsity in a neural network has been an efficient way to reduce its complexity while keeping its performance almost intact. Most of the time, sparsity is introduced using a three-stage pipeline: 1) train the model to convergence, 2) prune the model according to some criterion, 3) fine-tune the pruned model to recover performance. In our work, we propose to get rid of the first step of the pipeline and to combine the two other steps in a single pruning-training cycle.
arXiv Detail & Related papers (2021-07-05T15:27:07Z)
Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost. We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing. This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z)
Movement Pruning: Adaptive Sparsity by Fine-Tuning [115.91907953454034]
Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning. We propose the use of movement pruning, a simple, deterministic first-order weight pruning method. Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes.
arXiv Detail & Related papers (2020-05-15T17:54:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.