Toward Adaptive Large Language Models Structured Pruning via Hybrid-grained Weight Importance Assessment
- URL: http://arxiv.org/abs/2403.10799v5
- Date: Sun, 12 Jan 2025 06:47:39 GMT
- Title: Toward Adaptive Large Language Models Structured Pruning via Hybrid-grained Weight Importance Assessment
- Authors: Jun Liu, Zhenglun Kong, Pu Zhao, Changdi Yang, Hao Tang, Xuan Shen, Geng Yuan, Wei Niu, Wenbin Zhang, Xue Lin, Dong Huang, Yanzhi Wang,
- Abstract summary: We introduce the Hybrid-grained Weight Importance Assessment (HyWIA), a novel method that merges fine-grained and coarse-grained evaluations of weight importance for the pruning of large language models (LLMs)<n>Extensive experiments on LLaMA-V1/V2, Vicuna, Baichuan, and Bloom across various benchmarks demonstrate the effectiveness of HyWIA in pruning LLMs.
- Score: 58.030196381554745
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Structured pruning for large language models (LLMs) has garnered significant academic interest due to its ability to efficiently compress and accelerate LLMs by eliminating redundant weight groups at a coarse-grained granularity. Current structured pruning methods for LLMs typically depend on a singular granularity for assessing weight importance, resulting in notable performance degradation in downstream tasks. Intriguingly, our empirical investigations reveal that utilizing unstructured pruning, which achieves better performance retention by pruning weights at a finer granularity, \emph{i.e.}, individual weights, yields significantly varied sparse LLM structures when juxtaposed to structured pruning. This suggests that evaluating both holistic and individual assessment for weight importance is essential for LLM pruning. Building on this insight, we introduce the Hybrid-grained Weight Importance Assessment (HyWIA), a novel method that merges fine-grained and coarse-grained evaluations of weight importance for the pruning of LLMs. Leveraging an attention mechanism, HyWIA adaptively determines the optimal blend of granularity in weight importance assessments in an end-to-end pruning manner. Extensive experiments on LLaMA-V1/V2, Vicuna, Baichuan, and Bloom across various benchmarks demonstrate the effectiveness of HyWIA in pruning LLMs. For example, HyWIA surpasses the cutting-edge LLM-Pruner by an average margin of 2.82% in accuracy across seven downstream tasks when pruning LLaMA-7B by 50%.
Related papers
- The Other Side of the Coin: Exploring Fairness in Retrieval-Augmented Generation [73.16564415490113]
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by retrieving relevant document from external knowledge sources.
We propose two approaches, FairFT and FairFilter, to mitigate the fairness issues introduced by RAG for small-scale LLMs.
arXiv Detail & Related papers (2025-04-11T10:17:10Z) - Towards Extreme Pruning of LLMs with Plug-and-Play Mixed Sparsity [32.668409666483626]
Existing pruning methods mainly focus on designing metrics to measure the importance of network components to guide pruning.
We propose an efficient method based on the trace of Fisher Information Matrix (FIM) to quantitatively measure and verify the different sensitivities across layers.
Based on this, we propose Mixed Sparsity Pruning (MSP) which uses a pruning-oriented evolutionary algorithm (EA) to determine the optimal sparsity levels for different layers.
arXiv Detail & Related papers (2025-03-14T08:05:49Z) - Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.
LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.
We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z) - Densing Law of LLMs [81.06644243978101]
Large Language Models (LLMs) have emerged as a milestone in artificial intelligence, and their performance can improve as the model size increases.
This paper introduces the concept of textitcapacity density'' as a new metric to evaluate the quality of the LLMs across different scales.
arXiv Detail & Related papers (2024-12-05T16:31:13Z) - A deeper look at depth pruning of LLMs [49.30061112976263]
Large Language Models (LLMs) are resource-intensive to train but more costly to deploy in production.
Recent work has attempted to prune blocks of LLMs based on cheap proxies for estimating block importance.
We show that adaptive metrics exhibit a trade-off in performance between tasks.
arXiv Detail & Related papers (2024-07-23T08:40:27Z) - Implicit Generative Prior for Bayesian Neural Networks [8.013264410621357]
We propose a novel neural adaptive empirical Bayes (NA-EB) framework for complex data structures.
The proposed NA-EB framework combines variational inference with a gradient ascent algorithm.
We demonstrate the practical applications of our framework through extensive evaluations on a variety of tasks.
arXiv Detail & Related papers (2024-04-27T21:00:38Z) - Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity [88.62935593360162]
Large Language Models (LLMs) are renowned for their remarkable performance across diverse domains.
We introduce a novel LLM pruning methodology that incorporates a tailored set of non-uniform layerwise sparsity ratios, termed as Outlier Weighed Layerwise sparsity (OWL)
OWL exhibits a remarkable performance gain, surpassing the state-of-the-art Wanda and SparseGPT by 61.22 and 6.80 perplexity at a high sparsity level of 70%, respectively.
arXiv Detail & Related papers (2023-10-08T14:22:58Z) - Compresso: Structured Pruning with Collaborative Prompting Learns
Compact Large Language Models [15.471290825100075]
We introduce a new paradigm for structurally pruning Large Language Models, called Compresso.
Our approach, through the collaboration of the proposed resource-efficient pruning algorithm and the LLM itself, learns optimal pruning decisions during the training process.
In experiments, Compresso significantly outperforms one-shot pruning baselines across various sparsity ratios, achieving up to 2.21%, 11.43%, 7.04%, and 4.81% higher scores on the commonsense reasoning, reading comprehension, MMLU, and BBH benchmarks, respectively.
arXiv Detail & Related papers (2023-10-08T05:16:28Z) - ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language
Models [70.45441031021291]
Large Vision-Language Models (LVLMs) can understand the world comprehensively by integrating rich information from different modalities.
LVLMs are often problematic due to their massive computational/energy costs and carbon consumption.
We propose Efficient Coarse-to-Fine LayerWise Pruning (ECoFLaP), a two-stage coarse-to-fine weight pruning approach for LVLMs.
arXiv Detail & Related papers (2023-10-04T17:34:00Z) - Distributed Pruning Towards Tiny Neural Networks in Federated Learning [12.63559789381064]
FedTiny is a distributed pruning framework for federated learning.
It generates specialized tiny models for memory- and computing-constrained devices.
It achieves an accuracy improvement of 2.61% while significantly reducing the computational cost by 95.91%.
arXiv Detail & Related papers (2022-12-05T01:58:45Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Adaptive Activation-based Structured Pruning [5.445935252764351]
Pruning is a promising approach to compress complex deep learning models in order to deploy them on resource-constrained edge devices.
This paper presents an adaptive, activation-based, structured pruning approach to automatically and efficiently generate small, accurate, and hardware-efficient models.
A comprehensive evaluation shows that the proposed method can substantially outperform the state-of-the-art structured pruning works.
arXiv Detail & Related papers (2022-01-21T22:21:31Z) - Semantic Perturbations with Normalizing Flows for Improved
Generalization [62.998818375912506]
We show that perturbations in the latent space can be used to define fully unsupervised data augmentations.
We find that our latent adversarial perturbations adaptive to the classifier throughout its training are most effective.
arXiv Detail & Related papers (2021-08-18T03:20:00Z) - Self Normalizing Flows [65.73510214694987]
We propose a flexible framework for training normalizing flows by replacing expensive terms in the gradient by learned approximate inverses at each layer.
This reduces the computational complexity of each layer's exact update from $mathcalO(D3)$ to $mathcalO(D2)$.
We show experimentally that such models are remarkably stable and optimize to similar data likelihood values as their exact gradient counterparts.
arXiv Detail & Related papers (2020-11-14T09:51:51Z) - SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving
Out-of-Domain Robustness [66.37077266814822]
In natural language, it is difficult to generate new examples that stay on the underlying data manifold.
We introduce SSMBA, a data augmentation method for generating synthetic training examples.
In experiments on benchmarks across 3 tasks and 9 datasets, SSMBA consistently outperforms existing data augmentation methods.
arXiv Detail & Related papers (2020-09-21T22:02:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.