Fluctuation-based Adaptive Structured Pruning for Large Language Models
- URL: http://arxiv.org/abs/2312.11983v1
- Date: Tue, 19 Dec 2023 09:23:48 GMT
- Title: Fluctuation-based Adaptive Structured Pruning for Large Language Models
- Authors: Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang
- Abstract summary: FLAP (FLuctuation-based Adaptive Structured Pruning) is a retraining-free structured pruning framework for Large Language Models.
It is hardware-friendly by effectively reducing storage and enhancing inference speed.
- Score: 44.217363567065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Network Pruning is a promising way to address the huge computing resource
demands of the deployment and inference of Large Language Models (LLMs).
Retraining-free is important for LLMs' pruning methods. However, almost all of
the existing retraining-free pruning approaches for LLMs focus on unstructured
pruning, which requires specific hardware support for acceleration. In this
paper, we propose a novel retraining-free structured pruning framework for
LLMs, named FLAP (FLuctuation-based Adaptive Structured Pruning). It is
hardware-friendly by effectively reducing storage and enhancing inference
speed. For effective structured pruning of LLMs, we highlight three critical
elements that demand the utmost attention: formulating structured importance
metrics, adaptively searching the global compressed model, and implementing
compensation mechanisms to mitigate performance loss. First, FLAP determines
whether the output feature map is easily recoverable when a column of weight is
removed, based on the fluctuation pruning metric. Then it standardizes the
importance scores to adaptively determine the global compressed model
structure. At last, FLAP adds additional bias terms to recover the output
feature maps using the baseline values. We thoroughly evaluate our approach on
a variety of language benchmarks. Without any retraining, our method
significantly outperforms the state-of-the-art methods, including LLM-Pruner
and the extension of Wanda in structured pruning. The code is released at
https://github.com/CASIA-IVA-Lab/FLAP.
Related papers
- Reconstruct the Pruned Model without Any Retraining [23.235907813011174]
We introduce the Linear Interpolation-based Adaptive Reconstruction (LIAR) framework, which is both efficient and effective.
LIAR does not require back-propagation or retraining and is compatible with various pruning criteria and modules.
Our evaluations on benchmarks such as GLUE, SQuAD, WikiText, and common sense reasoning show that LIAR enables a BERT model to maintain 98% accuracy even after removing 50% of its parameters.
arXiv Detail & Related papers (2024-07-18T09:30:44Z) - Optimization-based Structural Pruning for Large Language Models without Back-Propagation [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models (LLMs)
Our method learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.
Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU, and our pruned models outperform the state-of-the-arts w.r.t. perplexity.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models [53.638791265113625]
Sparsity-Preserved efficient fine-tuning method for large language models.
Code will be made available at https://github.com/Lucky-Lance/SPP.
arXiv Detail & Related papers (2024-05-25T04:55:27Z) - LoRAPrune: Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning [56.88751562302793]
Low-rank adaption (LoRA) has emerged to fine-tune large language models (LLMs)
LoRAPrune is a new framework that delivers an accurate structured pruned model in a highly memory-efficient manner.
LoRAPrune achieves a reduction in perplexity by 4.81 on WikiText2 and 3.46 on PTB, while also decreasing memory usage by 52.6%.
arXiv Detail & Related papers (2023-05-28T15:15:48Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z) - Automatic Attention Pruning: Improving and Automating Model Pruning
using Attentions [5.445935252764351]
Pruning is a promising approach to compress deep learning models in order to deploy them on resource-constrained edge devices.
This paper presents Automatic Attention Pruning (AAP), an adaptive, attention-based, structured pruning approach to automatically generate small, accurate, and hardware-efficient models.
arXiv Detail & Related papers (2023-03-14T02:47:57Z) - MLPruning: A Multilevel Structured Pruning Framework for
Transformer-based Models [78.45898846056303]
Pruning is an effective method to reduce the memory footprint and computational cost associated with large natural language processing models.
We develop a novel MultiLevel structured Pruning framework, which uses three different levels of structured pruning: head pruning, row pruning, and block-wise sparse pruning.
arXiv Detail & Related papers (2021-05-30T22:00:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.