Related papers: AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models

AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models

URL: http://arxiv.org/abs/2410.10912v1
Date: Mon, 14 Oct 2024 03:35:11 GMT
Title: AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models
Authors: Haiquan Lu, Yefan Zhou, Shiwei Liu, Zhangyang Wang, Michael W. Mahoney, Yaoqing Yang,
Abstract summary: We propose AlphaPruning, which uses shape metrics to allocate layerwise sparsity ratios in a more theoretically principled manner. Our results show that AlphaPruning prunes LLaMA-7B to 80% sparsity while maintaining reasonable perplexity, marking a first in the literature on LLMs.
Score: 94.82766517752418
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent work on pruning large language models (LLMs) has shown that one can eliminate a large number of parameters without compromising performance, making pruning a promising strategy to reduce LLM model size. Existing LLM pruning strategies typically assign uniform pruning ratios across layers, limiting overall pruning ability; and recent work on layerwise pruning of LLMs is often based on heuristics that can easily lead to suboptimal performance. In this paper, we leverage Heavy-Tailed Self-Regularization (HT-SR) Theory, in particular the shape of empirical spectral densities (ESDs) of weight matrices, to design improved layerwise pruning ratios for LLMs. Our analysis reveals a wide variability in how well-trained, and thus relatedly how prunable, different layers of an LLM are. Based on this, we propose AlphaPruning, which uses shape metrics to allocate layerwise sparsity ratios in a more theoretically principled manner. AlphaPruning can be used in conjunction with multiple existing LLM pruning methods. Our empirical results show that AlphaPruning prunes LLaMA-7B to 80% sparsity while maintaining reasonable perplexity, marking a first in the literature on LLMs. We have open-sourced our code at https://github.com/haiquanlu/AlphaPruning.

Related papers

Maximum Redundancy Pruning: A Principle-Driven Layerwise Sparsity Allocation for LLMs [24.23702494859769]
Large language models (LLMs) have demonstrated impressive capabilities, but their enormous size poses significant challenges for deployment in real-world applications. Recent sparsity allocation methods is often based on iteration or search that can easily lead to suboptimal performance. We propose Redundancy Pruning (MRP), an iterative pruning algorithm that prunes in the most redundant layers.
arXiv Detail & Related papers (2025-03-24T06:17:30Z)
Reassessing Layer Pruning in LLMs: New Insights and Methods [24.394438652261982]
We show that a simple approach, i.e., pruning the final 25% of layers followed by fine-tuning the textttlm_head and the remaining last three layer, yields remarkably strong performance. We release the optimal model weights on Hface, and the code is available on GitHub.
arXiv Detail & Related papers (2024-11-23T13:31:16Z)
Pruning Foundation Models for High Accuracy without Retraining [48.256389781305415]
It is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. Post-training pruning methods are proposed to prune LLMs in one-shot without retraining. Our experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines.
arXiv Detail & Related papers (2024-10-21T01:23:34Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models [67.49462724595445]
Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs) We propose a novel method that involves learning scalable and pluggable virtual tokens for RAG.
arXiv Detail & Related papers (2024-05-30T03:44:54Z)
SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models [53.638791265113625]
Sparsity-Preserved efficient fine-tuning method for large language models. Code will be made available at https://github.com/Lucky-Lance/SPP.
arXiv Detail & Related papers (2024-05-25T04:55:27Z)
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect [38.148626520751385]
We show that many layers of Large Language Models (LLMs) exhibit high similarity, and some layers play a negligible role in network functionality. We propose a straightforward pruning approach: layer removal, in which we directly delete the redundant layers. Experiments demonstrate that our method, which we call ShortGPT, significantly outperforms previous state-of-the-art (SOTA) methods in model pruning.
arXiv Detail & Related papers (2024-03-06T17:04:18Z)
Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods [5.135352292810664]
We show that simple depth pruning can effectively compress large language models (LLMs) Our pruning method boosts inference speeds, especially under memory-constrained conditions. We hope this work can help build compact yet capable LLMs.
arXiv Detail & Related papers (2024-02-05T09:44:49Z)
One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models [42.95555008229016]
We propose a method based on Hessian sensitivity-aware mixed sparsity pruning to prune LLMs to at least 50% sparsity without the need of any retraining. The advantages of the proposed method exhibit even more when the sparsity is extremely high.
arXiv Detail & Related papers (2023-10-14T05:43:09Z)
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [67.38165028487242]
We introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach to fine-tune large language models (LLMs) Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs. Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs.
arXiv Detail & Related papers (2023-10-13T07:38:52Z)
Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity [88.62935593360162]
Large Language Models (LLMs) are renowned for their remarkable performance across diverse domains. We introduce a novel LLM pruning methodology that incorporates a tailored set of non-uniform layerwise sparsity ratios, termed as Outlier Weighed Layerwise sparsity (OWL) OWL exhibits a remarkable performance gain, surpassing the state-of-the-art Wanda and SparseGPT by 61.22 and 6.80 perplexity at a high sparsity level of 70%, respectively.
arXiv Detail & Related papers (2023-10-08T14:22:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.