Related papers: Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation

Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation

URL: http://arxiv.org/abs/2507.18212v1
Date: Thu, 24 Jul 2025 09:07:20 GMT
Title: Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation
Authors: Xinrui Chen, Hongxing Zhang, Fanyi Zeng, Yongxian Wei, Yizhi Wang, Xitong Ling, Guanghao Li, Chun Yuan,
Abstract summary: Layer pruning has emerged as a promising technique for compressing large language models (LLMs)<n>In this work, we identify that removing any layer induces a significant magnitude gap in hidden states, resulting in substantial performance degradation.<n>We propose Prune&Comp, a novel plug-and-play layer pruning scheme that mitigates such gaps in a training-free manner.
Score: 27.807507187324987
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Layer pruning has emerged as a promising technique for compressing large language models (LLMs) while achieving acceleration proportional to the pruning ratio. In this work, we identify that removing any layer induces a significant magnitude gap in hidden states, resulting in substantial performance degradation. To address this issue, we propose Prune&Comp, a novel plug-and-play layer pruning scheme that leverages magnitude compensation to mitigate such gaps in a training-free manner. Specifically, we first estimate the magnitude gap caused by layer removal and then eliminate this gap by rescaling the remaining weights offline, with zero runtime overhead incurred. We further demonstrate the advantages of Prune&Comp through an iterative pruning strategy. When integrated with an iterative prune-and-compensate loop, Prune&Comp consistently enhances existing layer pruning metrics. For instance, when 5 layers of LLaMA-3-8B are pruned using the prevalent block influence metric, Prune&Comp nearly halves the perplexity and retains 93.19\% of the original model's question-answering performance, outperforming the baseline by 4.01%.

Related papers

GradMAP: Faster Layer Pruning with Gradient Metric and Projection Compensation [23.236542656505417]
GradMAP is a faster layer pruning method with textbfGradient textbfMetric textbfAnd textbfProjection compensation.<n>In this study, we propose GradMAP, a faster layer pruning method with textbfGradient textbfMetric textbfAnd textbfProjection compensation.
arXiv Detail & Related papers (2026-02-16T11:14:02Z)
Compressing LLMs with MoP: Mixture of Pruners [0.5727968722424193]
MoP (Mixture of Pruners) is an iterative framework for model pruning.<n>It consistently outperforms depth-only and width-only pruning.<n>It translates into real speedup, reducing end-to-end latency by 39% at 40% compression.
arXiv Detail & Related papers (2026-02-05T19:01:06Z)
Gradually Compacting Large Language Models for Reasoning Like a Boiling Frog [72.4168434368873]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, but their substantial size often demands significant computational resources.<n>We propose a gradual compacting method that divides the compression process into multiple fine-grained iterations.<n>This iterative approach-reminiscent of the "boiling frog" effect-enables the model to be progressively compressed without abrupt performance loss.
arXiv Detail & Related papers (2026-02-04T06:56:52Z)
High-Layer Attention Pruning with Rescaling [14.141903038286362]
Pruning is a highly effective approach for compressing large language models (LLMs)<n>We propose a novel pruning algorithm that strategically prunes attention heads in the model's higher layers.<n>We conduct comprehensive experiments on a wide range of LLMs, including LLaMA3.1-8B, Mistral-7B-v0.3, Qwen2-7B, and Gemma2-9B.
arXiv Detail & Related papers (2025-07-02T17:15:05Z)
A Simple Linear Patch Revives Layer-Pruned Large Language Models [38.25088218910336]
We propose LinearPatch, a plug-and-play technique to revive the layer-pruned LLMs.<n> LinearPatch retains up to 94.15% performance of the original model when pruning 5 layers of LLaMA-3-8B on the question answering benchmark.<n>With only 5K samples, the retained performance of LinearPatch can be further boosted to 95.16% within 30 minutes on a single computing card.
arXiv Detail & Related papers (2025-05-30T15:06:08Z)
A Sliding Layer Merging Method for Efficient Depth-Wise Pruning in LLMs [13.000188564679998]
This paper reveals the Patch-like'' feature relationship between layers in large language models by analyzing the correlation of the outputs of different layers in the reproducing kernel Hilbert space.<n>We propose a sliding layer merging method that dynamically selects and fuses consecutive layers from top to bottom according to a pre-defined similarity threshold.<n>Our method outperforms existing pruning techniques in both zero-shot inference performance and retraining recovery quality after pruning.
arXiv Detail & Related papers (2025-02-26T14:15:24Z)
A deeper look at depth pruning of LLMs [49.30061112976263]
Large Language Models (LLMs) are resource-intensive to train but more costly to deploy in production. Recent work has attempted to prune blocks of LLMs based on cheap proxies for estimating block importance. We show that adaptive metrics exhibit a trade-off in performance between tasks.
arXiv Detail & Related papers (2024-07-23T08:40:27Z)
Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes [68.86687117368247]
We introduce Bonsai, a gradient-free structured pruning method that eliminates the need for backpropagation.<n>Bonsai achieves better compression with fewer resources, but also produces models that are twice as fast as those generated by semi-structured pruning.<n>Our results show that removing backprop as a requirement can also lead to state-of-the-art efficiency and performance.
arXiv Detail & Related papers (2024-02-08T04:48:26Z)
LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning [56.88751562302793]
Low-rank adaption (LoRA) has emerged to fine-tune large language models (LLMs) LoRAPrune is a new framework that delivers an accurate structured pruned model in a highly memory-efficient manner. LoRAPrune achieves a reduction in perplexity by 4.81 on WikiText2 and 3.46 on PTB, while also decreasing memory usage by 52.6%.
arXiv Detail & Related papers (2023-05-28T15:15:48Z)
Sparse Training via Boosting Pruning Plasticity with Neuroregeneration [79.78184026678659]
We study the effect of pruning throughout training from the perspective of pruning plasticity. We design a novel gradual magnitude pruning (GMP) method, named gradual pruning with zero-cost neuroregeneration (GraNet) and its dynamic sparse training (DST) variant (GraNet-ST) Perhaps most impressively, the latter for the first time boosts the sparse-to-sparse training performance over various dense-to-sparse methods by a large margin with ResNet-50 on ImageNet.
arXiv Detail & Related papers (2021-06-19T02:09:25Z)
MLPruning: A Multilevel Structured Pruning Framework for Transformer-based Models [78.45898846056303]
Pruning is an effective method to reduce the memory footprint and computational cost associated with large natural language processing models. We develop a novel MultiLevel structured Pruning framework, which uses three different levels of structured pruning: head pruning, row pruning, and block-wise sparse pruning.
arXiv Detail & Related papers (2021-05-30T22:00:44Z)
Dynamic Probabilistic Pruning: A general framework for hardware-constrained pruning at different granularities [80.06422693778141]
We propose a flexible new pruning mechanism that facilitates pruning at different granularities (weights, kernels, filters/feature maps) We refer to this algorithm as Dynamic Probabilistic Pruning (DPP) We show that DPP achieves competitive compression rates and classification accuracy when pruning common deep learning models trained on different benchmark datasets for image classification.
arXiv Detail & Related papers (2021-05-26T17:01:52Z)
Layer Pruning via Fusible Residual Convolutional Block for Deep Neural Networks [15.64167076052513]
layer pruning has less inference time and runtime memory usage when the same FLOPs and number of parameters are pruned. We propose a simple layer pruning method using residual convolutional block (ResConv) Our pruning method achieves excellent performance of compression and acceleration over the state-thearts on different datasets.
arXiv Detail & Related papers (2020-11-29T12:51:16Z)
Layer-adaptive sparsity for the Magnitude-based Pruning [88.37510230946478]
We propose a novel importance score for global pruning, coined layer-adaptive magnitude-based pruning (LAMP) score. LAMP consistently outperforms popular existing schemes for layerwise sparsity selection.
arXiv Detail & Related papers (2020-10-15T09:14:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.