Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity
- URL: http://arxiv.org/abs/2310.05175v3
- Date: Mon, 6 May 2024 07:36:08 GMT
- Title: Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity
- Authors: Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li, Ajay Jaiswal, Mykola Pechenizkiy, Yi Liang, Michael Bendersky, Zhangyang Wang, Shiwei Liu,
- Abstract summary: Large Language Models (LLMs) are renowned for their remarkable performance across diverse domains.
We introduce a novel LLM pruning methodology that incorporates a tailored set of non-uniform layerwise sparsity ratios, termed as Outlier Weighed Layerwise sparsity (OWL)
OWL exhibits a remarkable performance gain, surpassing the state-of-the-art Wanda and SparseGPT by 61.22 and 6.80 perplexity at a high sparsity level of 70%, respectively.
- Score: 88.62935593360162
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Language Models (LLMs), renowned for their remarkable performance across diverse domains, present a challenge when it comes to practical deployment due to their colossal model size. In response to this challenge, efforts have been directed toward the application of traditional network pruning techniques to LLMs, uncovering a massive number of parameters that can be pruned in one-shot without hurting performance. Prevailing LLM pruning strategies have consistently adhered to the practice of uniformly pruning all layers at equivalent sparsity, resulting in robust performance. However, this observation stands in contrast to the prevailing trends observed in the field of vision models, where non-uniform layerwise sparsity typically yields stronger results. To understand the underlying reasons for this disparity, we conduct a comprehensive study and discover a strong correlation with the emergence of activation outliers in LLMs. Inspired by this finding, we introduce a novel LLM pruning methodology that incorporates a tailored set of non-uniform layerwise sparsity ratios, termed as Outlier Weighed Layerwise sparsity (OWL). The sparsity ratio of OWL is proportional to the outlier ratio observed within each layer, facilitating a more effective alignment between layerwise weight sparsity and outlier ratios. Our empirical evaluation, conducted across the LLaMA-V1 family and OPT, spanning various benchmarks, demonstrates the distinct advantages offered by OWL over previous methods. For instance, OWL exhibits a remarkable performance gain, surpassing the state-of-the-art Wanda and SparseGPT by 61.22 and 6.80 perplexity at a high sparsity level of 70%, respectively, while delivering 2.6x end-to-end inference speed-up in the DeepSparse inference engine. Codes are available at https://github.com/luuyin/OWL.
Related papers
- MoE-I$^2$: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition [32.97035551579975]
We introduce a two-stage compression method tailored for MoE to reduce the model size and decrease the computational cost.
Experiments on Qwen1.5-MoE-A2.7B, DeepSeek-V2-Lite, and Mixtral-8$times$7B demonstrate that our proposed methods can both reduce the model size and enhance inference efficiency.
arXiv Detail & Related papers (2024-11-01T20:37:58Z) - AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models [94.82766517752418]
We propose AlphaPruning, which uses shape metrics to allocate layerwise sparsity ratios in a more theoretically principled manner.
Our results show that AlphaPruning prunes LLaMA-7B to 80% sparsity while maintaining reasonable perplexity, marking a first in the literature on LLMs.
arXiv Detail & Related papers (2024-10-14T03:35:11Z) - Can LLMs predict the convergence of Stochastic Gradient Descent? [5.206475868803433]
Large randomized models are notoriously famous for their impressive performance across a wide range of tasks.
One surprising example of such impressive performance is a recently identified tasks satisfying the principles of the Markovian systems.
arXiv Detail & Related papers (2024-08-03T10:35:59Z) - SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models [53.638791265113625]
Sparsity-Preserved efficient fine-tuning method for large language models.
Code will be made available at https://github.com/Lucky-Lance/SPP.
arXiv Detail & Related papers (2024-05-25T04:55:27Z) - LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning [31.088229461632206]
Large language models (LLMs) have become a significant roadblock to large-scale training.
Low-Rank Adaptation (LoRA) have been proposed to alleviate this problem.
We investigate the layerwise properties of LoRA on fine-tuning tasks and observe an unexpected but consistent skewness of weight norms.
We name it Layerwise Importance Sampled AdamW (LISA), a promising alternative for LoRA.
arXiv Detail & Related papers (2024-03-26T17:55:02Z) - Why Lift so Heavy? Slimming Large Language Models by Cutting Off the
Layers [2.1165011830664673]
Large Language Models (LLMs) possess outstanding capabilities in addressing various natural language processing (NLP) tasks.
The sheer size of these models poses challenges in terms of storage, training and inference due to the inclusion of billions of parameters through layer stacking.
We show that even with fewer layers, LLMs maintain similar or better performance levels, particularly in prompt-based fine-tuning for text classification tasks.
arXiv Detail & Related papers (2024-02-18T20:47:10Z) - PRILoRA: Pruned and Rank-Increasing Low-Rank Adaptation [65.268245109828]
We introduce PRILoRA, which linearly allocates a different rank for each layer, in an increasing manner, and performs pruning throughout the training process.
We validate the effectiveness of PRILoRA through extensive experiments on eight GLUE benchmarks, setting a new state of the art.
arXiv Detail & Related papers (2024-01-20T20:25:17Z) - ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language
Models [70.45441031021291]
Large Vision-Language Models (LVLMs) can understand the world comprehensively by integrating rich information from different modalities.
LVLMs are often problematic due to their massive computational/energy costs and carbon consumption.
We propose Efficient Coarse-to-Fine LayerWise Pruning (ECoFLaP), a two-stage coarse-to-fine weight pruning approach for LVLMs.
arXiv Detail & Related papers (2023-10-04T17:34:00Z) - DoLa: Decoding by Contrasting Layers Improves Factuality in Large
Language Models [79.01926242857613]
Large language models (LLMs) are prone to hallucinations, generating content that deviates from facts seen during pretraining.
We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs.
We find that this Decoding by Contrasting Layers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts.
arXiv Detail & Related papers (2023-09-07T17:45:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.