Related papers: Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes

Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes

URL: http://arxiv.org/abs/2402.05406v2
Date: Fri, 9 Feb 2024 19:53:56 GMT
Title: Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes
Authors: Lucio Dery, Steven Kolawole, Jean-Fran\c{c}ois Kagy, Virginia Smith, Graham Neubig, Ameet Talwalkar
Abstract summary: We develop a gradient-free, perturbative pruning method capable of delivering small, fast, and accurate pruned models. We also leverage Bonsai to produce a new sub-2B model using a single A6000 that yields state-of-the-art performance on 4/6 tasks on the Huggingface Open LLM leaderboard.
Score: 72.09861461921663
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Given the generational gap in available hardware between lay practitioners and the most endowed institutions, LLMs are becoming increasingly inaccessible as they grow in size. Whilst many approaches have been proposed to compress LLMs to make their resource consumption manageable, these methods themselves tend to be resource intensive, putting them out of the reach of the very user groups they target. In this work, we explore the problem of structured pruning of LLMs using only forward passes. We seek to empower practitioners to prune models so large that their available hardware has just enough memory to run inference. We develop Bonsai, a gradient-free, perturbative pruning method capable of delivering small, fast, and accurate pruned models. We observe that Bonsai outputs pruned models that (i) outperform those generated by more expensive gradient-based structured pruning methods, and (ii) are twice as fast (with comparable accuracy) as those generated by semi-structured pruning methods requiring comparable resources as Bonsai. We also leverage Bonsai to produce a new sub-2B model using a single A6000 that yields state-of-the-art performance on 4/6 tasks on the Huggingface Open LLM leaderboard.

Related papers

Mosaic: Composite Projection Pruning for Resource-efficient LLMs [2.6831773062745863]
This paper introduces projection pruning, a novel fine-grained method for pruning large language models (LLMs)<n>We develop Mosaic, a novel system to create and deploy pruned LLMs using composite projection pruning.<n>Mosaic models achieve up to 84.2% lower perplexity and 31.4% higher accuracy than models obtained from coarse-grained pruning.
arXiv Detail & Related papers (2025-04-08T11:51:35Z)
Progressive Binarization with Semi-Structured Pruning for LLMs [36.32239429974179]
Large language models (LLMs) have achieved remarkable success in natural language processing tasks. Their high computational and memory demands pose challenges for deployment on resource-constrained devices. We propose a Progressive Binarization with Semi-Structured Pruning (PBS$2$P) method for LLM compression.
arXiv Detail & Related papers (2025-02-03T13:30:29Z)
Lightweight and Post-Training Structured Pruning for On-Device Large Lanaguage Models [11.93284417365518]
We introduce COMP, a lightweight post-training structured pruning method that employs a hybrid-granularity pruning strategy. COMP improves performance by 6.13% on the LLaMA-2-7B model with a 20% pruning ratio compared to LLM-Pruner.
arXiv Detail & Related papers (2025-01-25T16:03:58Z)
FASP: Fast and Accurate Structured Pruning of Large Language Models [24.185245582500876]
We introduce FASP (Fast and Accurate Structured Pruning), a novel structured pruning framework for large language models (LLMs) FASP employs a distinctive pruning structure that interlinks sequential layers, allowing for the removal of columns in one layer while simultaneously eliminating corresponding rows in the preceding layer without incurring additional performance loss. We evaluate FASP on the OPT and LLaMA model families, demonstrating superior performance in terms of perplexity and accuracy on downstream tasks compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-01-16T09:38:39Z)
Reassessing Layer Pruning in LLMs: New Insights and Methods [24.394438652261982]
We show that a simple approach, i.e., pruning the final 25% of layers followed by fine-tuning the textttlm_head and the remaining last three layer, yields remarkably strong performance. We release the optimal model weights on Hface, and the code is available on GitHub.
arXiv Detail & Related papers (2024-11-23T13:31:16Z)
Pruning Foundation Models for High Accuracy without Retraining [48.256389781305415]
It is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. Post-training pruning methods are proposed to prune LLMs in one-shot without retraining. Our experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines.
arXiv Detail & Related papers (2024-10-21T01:23:34Z)
AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models [94.82766517752418]
We propose AlphaPruning, which uses shape metrics to allocate layerwise sparsity ratios in a more theoretically principled manner. Our results show that AlphaPruning prunes LLaMA-7B to 80% sparsity while maintaining reasonable perplexity, marking a first in the literature on LLMs.
arXiv Detail & Related papers (2024-10-14T03:35:11Z)
CFSP: An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information [33.01180010689081]
We introduce an efficient structured pruning framework named CFSP. We first allocate the sparsity budget across blocks based on their importance and then retain important weights within each block. Results demonstrate that CFSP outperforms existing methods on diverse models across various sparsity budgets.
arXiv Detail & Related papers (2024-09-20T04:03:27Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [52.29522018586365]
We study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains.
arXiv Detail & Related papers (2023-10-10T15:13:30Z)
LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning [56.88751562302793]
Low-rank adaption (LoRA) has emerged to fine-tune large language models (LLMs) LoRAPrune is a new framework that delivers an accurate structured pruned model in a highly memory-efficient manner. LoRAPrune achieves a reduction in perplexity by 4.81 on WikiText2 and 3.46 on PTB, while also decreasing memory usage by 52.6%.
arXiv Detail & Related papers (2023-05-28T15:15:48Z)
Advancing Model Pruning via Bi-level Optimization [89.88761425199598]
iterative magnitude pruning (IMP) is the predominant pruning method to successfully find 'winning tickets' One-shot pruning methods have been developed, but these schemes are usually unable to find winning tickets as good as IMP. We show that the proposed bi-level optimization-oriented pruning method (termed BiP) is a special class of BLO problems with a bi-linear problem structure.
arXiv Detail & Related papers (2022-10-08T19:19:29Z)
MLPruning: A Multilevel Structured Pruning Framework for Transformer-based Models [78.45898846056303]
Pruning is an effective method to reduce the memory footprint and computational cost associated with large natural language processing models. We develop a novel MultiLevel structured Pruning framework, which uses three different levels of structured pruning: head pruning, row pruning, and block-wise sparse pruning.
arXiv Detail & Related papers (2021-05-30T22:00:44Z)
Network Pruning via Resource Reallocation [75.85066435085595]
We propose a simple yet effective channel pruning technique, termed network Pruning via rEsource rEalLocation (PEEL) PEEL first constructs a predefined backbone and then conducts resource reallocation on it to shift parameters from less informative layers to more important layers in one round. Experimental results show that structures uncovered by PEEL exhibit competitive performance with state-of-the-art pruning algorithms under various pruning settings.
arXiv Detail & Related papers (2021-03-02T16:28:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.