Related papers: Efficient Pruning of Large Language Model with Adaptive Estimation Fusion

Efficient Pruning of Large Language Model with Adaptive Estimation Fusion

URL: http://arxiv.org/abs/2403.10799v3
Date: Wed, 15 May 2024 02:20:54 GMT
Title: Efficient Pruning of Large Language Model with Adaptive Estimation Fusion
Authors: Jun Liu, Chao Wu, Changdi Yang, Hao Tang, Zhenglun Kong, Geng Yuan, Wei Niu, Dong Huang, Yanzhi Wang,
Abstract summary: We introduce a simple yet efficient method that adaptively models the importance of each substructure. It can adaptively fuse coarse-grained and finegrained estimations based on the results from complex and multilayer structures. Our experimental results, compared with state-of-the-art methods on mainstream datasets, demonstrate average accuracy improvements of 1.1%, 1.02%, 2.0%, and 1.2% for LLaMa-7B,Vicuna-7B, Baichuan-7B, and Bloom-7b1, respectively.
Score: 45.423001839959156
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have become crucial for many generative downstream tasks, leading to an inevitable trend and significant challenge to deploy them efficiently on resource-constrained devices. Structured pruning is a widely used method to address this challenge. However, when dealing with the complex structure of the multiple decoder layers, general methods often employ common estimation approaches for pruning. These approaches lead to a decline in accuracy for specific downstream tasks. In this paper, we introduce a simple yet efficient method that adaptively models the importance of each substructure. Meanwhile, it can adaptively fuse coarse-grained and finegrained estimations based on the results from complex and multilayer structures. All aspects of our design seamlessly integrate into the endto-end pruning framework. Our experimental results, compared with state-of-the-art methods on mainstream datasets, demonstrate average accuracy improvements of 1.1%, 1.02%, 2.0%, and 1.2% for LLaMa-7B,Vicuna-7B, Baichuan-7B, and Bloom-7b1, respectively.

Related papers

IAM: Efficient Inference through Attention Mapping between Different-scale LLMs [74.81417160018856]
IAM framework achieves dual benefits of accelerated attention computation and reduced KV cache usage.<n>We show that IAM can accelerate prefill by 15% and reduce KV cache usage by 22.1% without appreciably sacrificing performance.
arXiv Detail & Related papers (2025-07-16T06:39:11Z)
ACE: Exploring Activation Cosine Similarity and Variance for Accurate and Calibration-Efficient LLM Pruning [15.933542902352604]
We propose an efficient and effective pruning method that simultaneously achieves high pruning performance and fast pruning speed.<n> Experimental results show that our method achieves up to an 18% reduction in perplexity and up to 63% decrease in pruning time on prevalent LLMs.
arXiv Detail & Related papers (2025-05-28T05:25:16Z)
The Other Side of the Coin: Exploring Fairness in Retrieval-Augmented Generation [73.16564415490113]
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by retrieving relevant document from external knowledge sources. We propose two approaches, FairFT and FairFilter, to mitigate the fairness issues introduced by RAG for small-scale LLMs.
arXiv Detail & Related papers (2025-04-11T10:17:10Z)
Towards Extreme Pruning of LLMs with Plug-and-Play Mixed Sparsity [32.668409666483626]
Existing pruning methods mainly focus on designing metrics to measure the importance of network components to guide pruning. We propose an efficient method based on the trace of Fisher Information Matrix (FIM) to quantitatively measure and verify the different sensitivities across layers. Based on this, we propose Mixed Sparsity Pruning (MSP) which uses a pruning-oriented evolutionary algorithm (EA) to determine the optimal sparsity levels for different layers.
arXiv Detail & Related papers (2025-03-14T08:05:49Z)
Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities. LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands. We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z)
Densing Law of LLMs [81.06644243978101]
Large Language Models (LLMs) have emerged as a milestone in artificial intelligence, and their performance can improve as the model size increases. This paper introduces the concept of textitcapacity density'' as a new metric to evaluate the quality of the LLMs across different scales.
arXiv Detail & Related papers (2024-12-05T16:31:13Z)
A deeper look at depth pruning of LLMs [49.30061112976263]
Large Language Models (LLMs) are resource-intensive to train but more costly to deploy in production. Recent work has attempted to prune blocks of LLMs based on cheap proxies for estimating block importance. We show that adaptive metrics exhibit a trade-off in performance between tasks.
arXiv Detail & Related papers (2024-07-23T08:40:27Z)
Implicit Generative Prior for Bayesian Neural Networks [8.013264410621357]
We propose a novel neural adaptive empirical Bayes (NA-EB) framework for complex data structures. The proposed NA-EB framework combines variational inference with a gradient ascent algorithm. We demonstrate the practical applications of our framework through extensive evaluations on a variety of tasks.
arXiv Detail & Related papers (2024-04-27T21:00:38Z)
Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity [88.62935593360162]
Large Language Models (LLMs) are renowned for their remarkable performance across diverse domains. We introduce a novel LLM pruning methodology that incorporates a tailored set of non-uniform layerwise sparsity ratios, termed as Outlier Weighed Layerwise sparsity (OWL) OWL exhibits a remarkable performance gain, surpassing the state-of-the-art Wanda and SparseGPT by 61.22 and 6.80 perplexity at a high sparsity level of 70%, respectively.
arXiv Detail & Related papers (2023-10-08T14:22:58Z)
Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models [15.471290825100075]
We introduce a new paradigm for structurally pruning Large Language Models, called Compresso. Our approach, through the collaboration of the proposed resource-efficient pruning algorithm and the LLM itself, learns optimal pruning decisions during the training process. In experiments, Compresso significantly outperforms one-shot pruning baselines across various sparsity ratios, achieving up to 2.21%, 11.43%, 7.04%, and 4.81% higher scores on the commonsense reasoning, reading comprehension, MMLU, and BBH benchmarks, respectively.
arXiv Detail & Related papers (2023-10-08T05:16:28Z)
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models [70.45441031021291]
Large Vision-Language Models (LVLMs) can understand the world comprehensively by integrating rich information from different modalities. LVLMs are often problematic due to their massive computational/energy costs and carbon consumption. We propose Efficient Coarse-to-Fine LayerWise Pruning (ECoFLaP), a two-stage coarse-to-fine weight pruning approach for LVLMs.
arXiv Detail & Related papers (2023-10-04T17:34:00Z)
Distributed Pruning Towards Tiny Neural Networks in Federated Learning [12.63559789381064]
FedTiny is a distributed pruning framework for federated learning. It generates specialized tiny models for memory- and computing-constrained devices. It achieves an accuracy improvement of 2.61% while significantly reducing the computational cost by 95.91%.
arXiv Detail & Related papers (2022-12-05T01:58:45Z)
HyperImpute: Generalized Iterative Imputation with Automatic Model Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models. We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z)
Adaptive Activation-based Structured Pruning [5.445935252764351]
Pruning is a promising approach to compress complex deep learning models in order to deploy them on resource-constrained edge devices. This paper presents an adaptive, activation-based, structured pruning approach to automatically and efficiently generate small, accurate, and hardware-efficient models. A comprehensive evaluation shows that the proposed method can substantially outperform the state-of-the-art structured pruning works.
arXiv Detail & Related papers (2022-01-21T22:21:31Z)
Semantic Perturbations with Normalizing Flows for Improved Generalization [62.998818375912506]
We show that perturbations in the latent space can be used to define fully unsupervised data augmentations. We find that our latent adversarial perturbations adaptive to the classifier throughout its training are most effective.
arXiv Detail & Related papers (2021-08-18T03:20:00Z)
Self Normalizing Flows [65.73510214694987]
We propose a flexible framework for training normalizing flows by replacing expensive terms in the gradient by learned approximate inverses at each layer. This reduces the computational complexity of each layer's exact update from $mathcalO(D3)$ to $mathcalO(D2)$. We show experimentally that such models are remarkably stable and optimize to similar data likelihood values as their exact gradient counterparts.
arXiv Detail & Related papers (2020-11-14T09:51:51Z)
SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness [66.37077266814822]
In natural language, it is difficult to generate new examples that stay on the underlying data manifold. We introduce SSMBA, a data augmentation method for generating synthetic training examples. In experiments on benchmarks across 3 tasks and 9 datasets, SSMBA consistently outperforms existing data augmentation methods.
arXiv Detail & Related papers (2020-09-21T22:02:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.