Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes
- URL: http://arxiv.org/abs/2402.05406v2
- Date: Fri, 9 Feb 2024 19:53:56 GMT
- Title: Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes
- Authors: Lucio Dery, Steven Kolawole, Jean-Fran\c{c}ois Kagy, Virginia Smith,
Graham Neubig, Ameet Talwalkar
- Abstract summary: We develop a gradient-free, perturbative pruning method capable of delivering small, fast, and accurate pruned models.
We also leverage Bonsai to produce a new sub-2B model using a single A6000 that yields state-of-the-art performance on 4/6 tasks on the Huggingface Open LLM leaderboard.
- Score: 72.09861461921663
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given the generational gap in available hardware between lay practitioners
and the most endowed institutions, LLMs are becoming increasingly inaccessible
as they grow in size. Whilst many approaches have been proposed to compress
LLMs to make their resource consumption manageable, these methods themselves
tend to be resource intensive, putting them out of the reach of the very user
groups they target. In this work, we explore the problem of structured pruning
of LLMs using only forward passes. We seek to empower practitioners to prune
models so large that their available hardware has just enough memory to run
inference. We develop Bonsai, a gradient-free, perturbative pruning method
capable of delivering small, fast, and accurate pruned models.
We observe that Bonsai outputs pruned models that (i) outperform those
generated by more expensive gradient-based structured pruning methods, and (ii)
are twice as fast (with comparable accuracy) as those generated by
semi-structured pruning methods requiring comparable resources as Bonsai. We
also leverage Bonsai to produce a new sub-2B model using a single A6000 that
yields state-of-the-art performance on 4/6 tasks on the Huggingface Open LLM
leaderboard.
Related papers
- Reassessing Layer Pruning in LLMs: New Insights and Methods [24.394438652261982]
We show that a simple approach, i.e., pruning the final 25% of layers followed by fine-tuning the textttlm_head and the remaining last three layer, yields remarkably strong performance.
We release the optimal model weights on Hface, and the code is available on GitHub.
arXiv Detail & Related papers (2024-11-23T13:31:16Z) - Pruning Foundation Models for High Accuracy without Retraining [48.256389781305415]
It is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations.
Post-training pruning methods are proposed to prune LLMs in one-shot without retraining.
Our experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines.
arXiv Detail & Related papers (2024-10-21T01:23:34Z) - AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models [94.82766517752418]
We propose AlphaPruning, which uses shape metrics to allocate layerwise sparsity ratios in a more theoretically principled manner.
Our results show that AlphaPruning prunes LLaMA-7B to 80% sparsity while maintaining reasonable perplexity, marking a first in the literature on LLMs.
arXiv Detail & Related papers (2024-10-14T03:35:11Z) - Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models.
We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.
Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [52.29522018586365]
We study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models.
Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains.
arXiv Detail & Related papers (2023-10-10T15:13:30Z) - LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning [56.88751562302793]
Low-rank adaption (LoRA) has emerged to fine-tune large language models (LLMs)
LoRAPrune is a new framework that delivers an accurate structured pruned model in a highly memory-efficient manner.
LoRAPrune achieves a reduction in perplexity by 4.81 on WikiText2 and 3.46 on PTB, while also decreasing memory usage by 52.6%.
arXiv Detail & Related papers (2023-05-28T15:15:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.