Related papers: A Simple Linear Patch Revives Layer-Pruned Large Language Models

A Simple Linear Patch Revives Layer-Pruned Large Language Models

URL: http://arxiv.org/abs/2505.24680v2
Date: Sat, 25 Oct 2025 07:24:08 GMT
Title: A Simple Linear Patch Revives Layer-Pruned Large Language Models
Authors: Xinrui Chen, Haoli Bai, Tao Yuan, Ruikang Liu, Kang Zhao, Xianzhi Yu, Lu Hou, Tian Guan, Yonghong He, Chun Yuan,
Abstract summary: Layer pruning has emerged as a widely used technique for compressing large language models (LLMs)<n>textscLinearPatch fuses two operations into one matrix multiply at the pruning interface.<n>The patch can be further refined with 5K unlabeled samples via memory-efficient offline distillation, pushing the retention to 95.16% within only 30 minutes on a single GPU.
Score: 58.056251480151104
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Layer pruning has emerged as a widely used technique for compressing large language models (LLMs). However, existing layer pruning approaches often incur substantial performance degradation. We identify the majority of this degradation to a single yet previously overlooked issue: \textit{the mismatch of activation magnitudes at the pruning interface}. The pre-interface activations exhibit significantly different scales from the post-interface ones, causing the distributional shift as it propagates through the remaining layers. To address this issue, we introduce \textsc{LinearPatch}, a lightweight and plug-and-play technique that fuses two operations into one matrix multiply at the pruning interface: (i) a Hadamard transformation that suppresses massive outliers at particular tokens and (ii) a channel-wise scaling that aligns activation statistics. On LLaMA-3-8B, \textsc{LinearPatch} preserves up to \textbf{94.15\%} of the original model's performance when pruning 5 out of 32 layers, outperforming the previous state of the art by \textbf{4\%}. The patch can be further refined with 5K unlabeled samples via memory-efficient offline distillation, pushing the retention to 95.16\% within only 30 minutes on a single GPU. Code is available at https://github.com/chenxinrui-tsinghua/LinearPatch.

Related papers

ShiftLUT: Spatial Shift Enhanced Look-Up Tables for Efficient Image Restoration [8.845117852325997]
ShiftLUT is a novel framework that attains the largest receptive field among all LUT-based methods while maintaining high efficiency.<n>Compared to the previous state-of-the-art method TinyLUT, ShiftLUT achieves a 3.8$times$ larger receptive field and improves an average PSNR by over 0.21 dB.
arXiv Detail & Related papers (2026-03-01T04:00:23Z)
E$^3$-Pruner: Towards Efficient, Economical, and Effective Layer Pruning for Large Language Models [24.195465096877196]
layer pruning is a hardware-friendly approach for model compression.<n> introduces two key innovations: a differentiable mask optimization method and an entropy-aware adaptive knowledge distillation strategy.<n> achieves 96% accuracy, a mere 0.8% drop from the original model.
arXiv Detail & Related papers (2025-11-21T12:32:01Z)
Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation [43.822941944402544]
Large Language Models excel at natural language processing tasks, but their massive size leads to high computational and storage demands.<n>Recent works have sought to reduce their model size through layer-wise structured pruning.<n>We re-examine structured pruning paradigms and uncover several key limitations.
arXiv Detail & Related papers (2025-10-17T04:27:06Z)
COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens [8.846552276586918]
Pruning is a promising technique, but existing pruning methods are limited.<n>In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/LM head layers and (ii) prunes FFN intermediate channels using common-token-weighted activations.<n> Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream performance, with substantial reductions in parameters, GPU memory, and latency.
arXiv Detail & Related papers (2025-09-08T16:07:06Z)
Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation [27.807507187324987]
Layer pruning has emerged as a promising technique for compressing large language models (LLMs)<n>In this work, we identify that removing any layer induces a significant magnitude gap in hidden states, resulting in substantial performance degradation.<n>We propose Prune&Comp, a novel plug-and-play layer pruning scheme that mitigates such gaps in a training-free manner.
arXiv Detail & Related papers (2025-07-24T09:07:20Z)
A3 : an Analytical Low-Rank Approximation Framework for Attention [14.649496050074735]
We propose $tt Attt 3$, a post-training low-rank approximation framework.<n>We show that $tt Attt 3$ maintains superior performance compared to SoTAs.<n>We also demonstrate the versatility of $tt Att 3$, including KV cache compression, quantization, and mixed-rank assignments for enhanced performance.
arXiv Detail & Related papers (2025-05-19T10:29:32Z)
Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models.<n>We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z)
A Sliding Layer Merging Method for Efficient Depth-Wise Pruning in LLMs [13.000188564679998]
This paper reveals the Patch-like'' feature relationship between layers in large language models by analyzing the correlation of the outputs of different layers in the reproducing kernel Hilbert space.<n>We propose a sliding layer merging method that dynamically selects and fuses consecutive layers from top to bottom according to a pre-defined similarity threshold.<n>Our method outperforms existing pruning techniques in both zero-shot inference performance and retraining recovery quality after pruning.
arXiv Detail & Related papers (2025-02-26T14:15:24Z)
FlexiGPT: Pruning and Extending Large Language Models with Low-Rank Weight Sharing [59.12511498024836]
We present a method to prune large language models (LLMs) that selectively prunes model blocks based on an importance score.<n>We propose a principled metric to replace each pruned block using a weight-sharing mechanism.<n> Empirical evaluations demonstrate substantial performance gains over existing methods.
arXiv Detail & Related papers (2025-01-24T18:46:37Z)
FASP: Fast and Accurate Structured Pruning of Large Language Models [24.185245582500876]
We introduce FASP (Fast and Accurate Structured Pruning), a novel structured pruning framework for large language models (LLMs)<n>FASP employs a distinctive pruning structure that interlinks sequential layers, allowing for the removal of columns in one layer while simultaneously eliminating corresponding rows in the preceding layer without incurring additional performance loss.<n>We evaluate FASP on the OPT and LLaMA model families, demonstrating superior performance in terms of perplexity and accuracy on downstream tasks compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-01-16T09:38:39Z)
A deeper look at depth pruning of LLMs [49.30061112976263]
Large Language Models (LLMs) are resource-intensive to train but more costly to deploy in production. Recent work has attempted to prune blocks of LLMs based on cheap proxies for estimating block importance. We show that adaptive metrics exhibit a trade-off in performance between tasks.
arXiv Detail & Related papers (2024-07-23T08:40:27Z)
ADMM Based Semi-Structured Pattern Pruning Framework For Transformer [4.02487511510606]
This paper introduces Alternating Direction Method of Multipliers(ADMM) based pattern pruning framework to reshape the distribution of activation map. We conduct extensive experiments on classification tasks over GLUE dataset. We achieve 50% percent compression ratio while maintaining overall score 80.1 on GLUE dataset.
arXiv Detail & Related papers (2024-07-11T09:35:08Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models [54.787308652357794]
FinerCut is a new form of fine-grained layer pruning for transformer networks. Our approach retains 90% performance of Llama3-8B with 25% layers removed, and 95% performance of Llama3-70B with 30% layers removed, all without fine-tuning or post-pruning reconstruction.
arXiv Detail & Related papers (2024-05-28T14:21:15Z)
Streamlining Redundant Layers to Compress Large Language Models [21.27944103424621]
This paper introduces LLM-Streamline, a pioneer work on layer pruning for large language models (LLMs)<n>It is based on the observation that different layers have varying impacts on hidden states, enabling the identification of less important layers to be pruned.<n>Experiments show that LLM-Streamline outperforms both previous and concurrent state-of-the-art pruning methods in terms of both performance and training efficiency.
arXiv Detail & Related papers (2024-03-28T04:12:13Z)
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation [54.28841287750586]
Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc. Existing solutions such as SparseGPT and Wanda attempt to alleviate this issue through weight pruning. This paper introduces a novel LLM pruning technique dubbed blockwise parameter-efficient sparsity allocation (BESA) by applying a blockwise reconstruction loss.
arXiv Detail & Related papers (2024-02-18T12:44:15Z)
Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes [68.86687117368247]
We introduce Bonsai, a gradient-free structured pruning method that eliminates the need for backpropagation.<n>Bonsai achieves better compression with fewer resources, but also produces models that are twice as fast as those generated by semi-structured pruning.<n>Our results show that removing backprop as a requirement can also lead to state-of-the-art efficiency and performance.
arXiv Detail & Related papers (2024-02-08T04:48:26Z)
From PEFT to DEFT: Parameter Efficient Finetuning for Reducing Activation Density in Transformers [52.199303258423306]
We propose a novel density loss that encourages higher activation sparsity in pre-trained models. Our proposed method, textbfDEFT, can consistently reduce activation density by up to textbf44.94% on RoBERTa$_mathrmLarge$ and by textbf53.19% (encoder density) and textbf90.60% (decoder density) on Flan-T5$_mathrmXXL$.
arXiv Detail & Related papers (2024-02-02T21:25:46Z)
DiffRate : Differentiable Compression Rate for Efficient Vision Transformers [98.33906104846386]
Token compression aims to speed up large-scale vision transformers (e.g. ViTs) by pruning (dropping) or merging tokens. DiffRate is a novel token compression method that has several appealing properties prior arts do not have.
arXiv Detail & Related papers (2023-05-29T10:15:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.