Maximum Redundancy Pruning: A Principle-Driven Layerwise Sparsity Allocation for LLMs
- URL: http://arxiv.org/abs/2503.18377v1
- Date: Mon, 24 Mar 2025 06:17:30 GMT
- Title: Maximum Redundancy Pruning: A Principle-Driven Layerwise Sparsity Allocation for LLMs
- Authors: Chang Gao, Kang Zhao, Jianfei Chen, Liping Jing,
- Abstract summary: Large language models (LLMs) have demonstrated impressive capabilities, but their enormous size poses significant challenges for deployment in real-world applications.<n>Recent sparsity allocation methods is often based on iteration or search that can easily lead to suboptimal performance.<n>We propose Redundancy Pruning (MRP), an iterative pruning algorithm that prunes in the most redundant layers.
- Score: 24.23702494859769
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have demonstrated impressive capabilities, but their enormous size poses significant challenges for deployment in real-world applications. To address this issue, researchers have sought to apply network pruning techniques to LLMs. A critical challenge in pruning is allocation the sparsity for each layer. Recent sparsity allocation methods is often based on heuristics or search that can easily lead to suboptimal performance. In this paper, we conducted an extensive investigation into various LLMs and revealed three significant discoveries: (1) the layerwise pruning sensitivity (LPS) of LLMs is highly non-uniform, (2) the choice of pruning metric affects LPS, and (3) the performance of a sparse model is related to the uniformity of its layerwise redundancy level. Based on these observations, we propose that the layerwise sparsity of LLMs should adhere to three principles: \emph{non-uniformity}, \emph{pruning metric dependency}, and \emph{uniform layerwise redundancy level} in the pruned model. To this end, we proposed Maximum Redundancy Pruning (MRP), an iterative pruning algorithm that prunes in the most redundant layers (\emph{i.e.}, those with the highest non-outlier ratio) at each iteration. The achieved layerwise sparsity aligns with the outlined principles. We conducted extensive experiments on publicly available LLMs, including the LLaMA2 and OPT, across various benchmarks. Experimental results validate the effectiveness of MRP, demonstrating its superiority over previous methods.
Related papers
- Towards Extreme Pruning of LLMs with Plug-and-Play Mixed Sparsity [32.668409666483626]
Existing pruning methods mainly focus on designing metrics to measure the importance of network components to guide pruning.
We propose an efficient method based on the trace of Fisher Information Matrix (FIM) to quantitatively measure and verify the different sensitivities across layers.
Based on this, we propose Mixed Sparsity Pruning (MSP) which uses a pruning-oriented evolutionary algorithm (EA) to determine the optimal sparsity levels for different layers.
arXiv Detail & Related papers (2025-03-14T08:05:49Z) - Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective [55.90119819642064]
We address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective.
This refers to the cumulative effect of reconstruction errors throughout the sparsification process.
We derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue.
arXiv Detail & Related papers (2025-02-20T17:51:10Z) - Universal Model Routing for Efficient LLM Inference [72.65083061619752]
We consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time.
We propose a new approach to this problem that relies on representing each LLM as a feature vector, derived based on predictions on a set of representative prompts.
We prove that these strategies are estimates of a theoretically optimal routing rule, and provide an excess risk bound to quantify their errors.
arXiv Detail & Related papers (2025-02-12T20:30:28Z) - Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.<n>LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.<n>We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z) - CogSteer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering Large Language Models [37.476241509187304]
Large Language Models (LLMs) achieve remarkable performance through pretraining on extensive data.
The lack of interpretability in their underlying mechanisms limits the ability to effectively steer LLMs for specific applications.
In this work, we investigate the mechanisms of LLMs from a cognitive perspective using eye movement measures.
arXiv Detail & Related papers (2024-10-23T09:40:15Z) - AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models [94.82766517752418]
We propose AlphaPruning, which uses shape metrics to allocate layerwise sparsity ratios in a more theoretically principled manner.
Our results show that AlphaPruning prunes LLaMA-7B to 80% sparsity while maintaining reasonable perplexity, marking a first in the literature on LLMs.
arXiv Detail & Related papers (2024-10-14T03:35:11Z) - Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models.
We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.
Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - ShortGPT: Layers in Large Language Models are More Redundant Than You Expect [38.148626520751385]
We show that many layers of Large Language Models (LLMs) exhibit high similarity, and some layers play a negligible role in network functionality.
We propose a straightforward pruning approach: layer removal, in which we directly delete the redundant layers.
Experiments demonstrate that our method, which we call ShortGPT, significantly outperforms previous state-of-the-art (SOTA) methods in model pruning.
arXiv Detail & Related papers (2024-03-06T17:04:18Z) - Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity [88.62935593360162]
Large Language Models (LLMs) are renowned for their remarkable performance across diverse domains.
We introduce a novel LLM pruning methodology that incorporates a tailored set of non-uniform layerwise sparsity ratios, termed as Outlier Weighed Layerwise sparsity (OWL)
OWL exhibits a remarkable performance gain, surpassing the state-of-the-art Wanda and SparseGPT by 61.22 and 6.80 perplexity at a high sparsity level of 70%, respectively.
arXiv Detail & Related papers (2023-10-08T14:22:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.