Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes
- URL: http://arxiv.org/abs/2402.05406v2
- Date: Fri, 9 Feb 2024 19:53:56 GMT
- Title: Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes
- Authors: Lucio Dery, Steven Kolawole, Jean-Fran\c{c}ois Kagy, Virginia Smith,
Graham Neubig, Ameet Talwalkar
- Abstract summary: We develop a gradient-free, perturbative pruning method capable of delivering small, fast, and accurate pruned models.
We also leverage Bonsai to produce a new sub-2B model using a single A6000 that yields state-of-the-art performance on 4/6 tasks on the Huggingface Open LLM leaderboard.
- Score: 72.09861461921663
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given the generational gap in available hardware between lay practitioners
and the most endowed institutions, LLMs are becoming increasingly inaccessible
as they grow in size. Whilst many approaches have been proposed to compress
LLMs to make their resource consumption manageable, these methods themselves
tend to be resource intensive, putting them out of the reach of the very user
groups they target. In this work, we explore the problem of structured pruning
of LLMs using only forward passes. We seek to empower practitioners to prune
models so large that their available hardware has just enough memory to run
inference. We develop Bonsai, a gradient-free, perturbative pruning method
capable of delivering small, fast, and accurate pruned models.
We observe that Bonsai outputs pruned models that (i) outperform those
generated by more expensive gradient-based structured pruning methods, and (ii)
are twice as fast (with comparable accuracy) as those generated by
semi-structured pruning methods requiring comparable resources as Bonsai. We
also leverage Bonsai to produce a new sub-2B model using a single A6000 that
yields state-of-the-art performance on 4/6 tasks on the Huggingface Open LLM
leaderboard.
Related papers
- Compact Language Models via Pruning and Knowledge Distillation [61.56557874432008]
Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch.
Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch.
arXiv Detail & Related papers (2024-07-19T21:47:57Z) - Optimization-based Structural Pruning for Large Language Models without Back-Propagation [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models (LLMs)
Our method learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.
Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU, and our pruned models outperform the state-of-the-arts w.r.t. perplexity.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models [53.638791265113625]
Sparsity-Preserved efficient fine-tuning method for large language models.
Code will be made available at https://github.com/Lucky-Lance/SPP.
arXiv Detail & Related papers (2024-05-25T04:55:27Z) - Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods [5.135352292810664]
We show that simple depth pruning can effectively compress large language models (LLMs)
Our pruning method boosts inference speeds, especially under memory-constrained conditions.
We hope this work can help build compact yet capable LLMs.
arXiv Detail & Related papers (2024-02-05T09:44:49Z) - Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [52.29522018586365]
We study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models.
Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains.
arXiv Detail & Related papers (2023-10-10T15:13:30Z) - Compresso: Structured Pruning with Collaborative Prompting Learns
Compact Large Language Models [15.471290825100075]
We introduce a new paradigm for structurally pruning Large Language Models, called Compresso.
Our approach, through the collaboration of the proposed resource-efficient pruning algorithm and the LLM itself, learns optimal pruning decisions during the training process.
In experiments, Compresso significantly outperforms one-shot pruning baselines across various sparsity ratios, achieving up to 2.21%, 11.43%, 7.04%, and 4.81% higher scores on the commonsense reasoning, reading comprehension, MMLU, and BBH benchmarks, respectively.
arXiv Detail & Related papers (2023-10-08T05:16:28Z) - LoRAPrune: Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning [56.88751562302793]
Low-rank adaption (LoRA) has emerged to fine-tune large language models (LLMs)
LoRAPrune is a new framework that delivers an accurate structured pruned model in a highly memory-efficient manner.
LoRAPrune achieves a reduction in perplexity by 4.81 on WikiText2 and 3.46 on PTB, while also decreasing memory usage by 52.6%.
arXiv Detail & Related papers (2023-05-28T15:15:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.