Improved Methods for Model Pruning and Knowledge Distillation
- URL: http://arxiv.org/abs/2505.14052v1
- Date: Tue, 20 May 2025 07:53:40 GMT
- Title: Improved Methods for Model Pruning and Knowledge Distillation
- Authors: Wei Jiang, Anying Fu, Youling Zhang,
- Abstract summary: MAMA Pruning is a performance optimization technique for large language models like R1 or o3-mini.<n>It effectively reduces model size and computational complexity while maintaining performance comparable to the original unpruned model even at extreme pruned levels.<n>Preliminary experimental results show that our method outperforms and be comparable to state-of-the-art methods across various pruning levels and different downstream computational linguistics tasks.
- Score: 3.8993503758122663
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Model pruning is a performance optimization technique for large language models like R1 or o3-mini. However, existing pruning methods often lead to significant performance degradation or require extensive retraining and fine-tuning. This technique aims to identify and remove neurons, connections unlikely leading to the contribution during the human-computer interaction phase. Our goal is to obtain a much smaller and faster knowledge distilled model that can quickly generate content almost as good as those of the unpruned ones. We propose MAMA Pruning, short for Movement and Magnitude Analysis, an improved pruning method that effectively reduces model size and computational complexity while maintaining performance comparable to the original unpruned model even at extreme pruned levels. The improved method is based on weights, bias fixed in the pre-training phase and GRPO rewards verified during the post-training phase as our novel pruning indicators. Preliminary experimental results show that our method outperforms and be comparable to state-of-the-art methods across various pruning levels and different downstream computational linguistics tasks.
Related papers
- IDEA Prune: An Integrated Enlarge-and-Prune Pipeline in Generative Language Model Pretraining [50.53912352342753]
We propose an integrated enlarge-and-prune pipeline, which combines enlarge model training, pruning, and recovery.<n>We conduct experiments on compressing 2.8B models to 1.3B with up to 2T tokens in pretraining.<n>It demonstrates the integrated approach not only provides insights into the token efficiency of enlarged model pretraining but also achieves superior performance of pruned models.
arXiv Detail & Related papers (2025-03-07T20:35:31Z) - DReSS: Data-driven Regularized Structured Streamlining for Large Language Models [30.47317140878219]
Large language models (LLMs) have achieved significant progress across various domains, but their increasing scale results in high computational and memory costs.<n>We propose a novel paradigm that first applies regularization, then prunes, and finally finetunes.<n>By leveraging a small amount of data to regularize the components to be pruned, DReSS explicitly transfers the important information to the remaining parts of the model in advance.
arXiv Detail & Related papers (2025-01-29T14:28:11Z) - Numerical Pruning for Efficient Autoregressive Models [87.56342118369123]
This paper focuses on compressing decoder-only transformer-based autoregressive models through structural weight pruning.<n>Specifically, we propose a training-free pruning method that calculates a numerical score with Newton's method for the Attention and modules, respectively.<n>To verify the effectiveness of our method, we provide both theoretical support and extensive experiments.
arXiv Detail & Related papers (2024-12-17T01:09:23Z) - Effective Layer Pruning Through Similarity Metric Perspective [0.0]
Deep neural networks have been the predominant paradigm in machine learning for solving cognitive tasks.
Pruning structures from these models is a straightforward approach to reducing network complexity.
Layer pruning often hurts the network predictive ability (i.e., accuracy) at high compression rates.
This work introduces an effective layer-pruning strategy that meets all underlying properties pursued by pruning methods.
arXiv Detail & Related papers (2024-05-27T11:54:51Z) - LaCo: Large Language Model Pruning via Layer Collapse [56.92068213969036]
Large language models (LLMs) based on transformer are witnessing a notable trend of size expansion.
Existing methods such as model quantization, knowledge distillation, and model pruning are constrained by various issues.
We propose a concise layer-wise structured pruner called textitLayer Collapse (LaCo), in which rear model layers collapse into a prior layer.
arXiv Detail & Related papers (2024-02-17T04:16:30Z) - Sparse Progressive Distillation: Resolving Overfitting under
Pretrain-and-Finetune Paradigm [7.662952656290564]
Various pruning approaches have been proposed to reduce the footprint requirements of Transformer-based language models.
We show for the first time that reducing the risk of overfitting can help the effectiveness of pruning under the pretrain-and-finetune paradigm.
arXiv Detail & Related papers (2021-10-15T16:42:56Z) - Block Pruning For Faster Transformers [89.70392810063247]
We introduce a block pruning approach targeting both small and fast models.
We find that this approach learns to prune out full components of the underlying model, such as attention heads.
arXiv Detail & Related papers (2021-09-10T12:46:32Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - BERT Loses Patience: Fast and Robust Inference with Early Exit [91.26199404912019]
We propose Patience-based Early Exit as a plug-and-play technique to improve the efficiency and robustness of a pretrained language model.
Our approach improves inference efficiency as it allows the model to make a prediction with fewer layers.
arXiv Detail & Related papers (2020-06-07T13:38:32Z) - Movement Pruning: Adaptive Sparsity by Fine-Tuning [115.91907953454034]
Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning.
We propose the use of movement pruning, a simple, deterministic first-order weight pruning method.
Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes.
arXiv Detail & Related papers (2020-05-15T17:54:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.