Fluctuation-based Adaptive Structured Pruning for Large Language Models
- URL: http://arxiv.org/abs/2312.11983v1
- Date: Tue, 19 Dec 2023 09:23:48 GMT
- Title: Fluctuation-based Adaptive Structured Pruning for Large Language Models
- Authors: Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang
- Abstract summary: FLAP (FLuctuation-based Adaptive Structured Pruning) is a retraining-free structured pruning framework for Large Language Models.
It is hardware-friendly by effectively reducing storage and enhancing inference speed.
- Score: 44.217363567065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Network Pruning is a promising way to address the huge computing resource
demands of the deployment and inference of Large Language Models (LLMs).
Retraining-free is important for LLMs' pruning methods. However, almost all of
the existing retraining-free pruning approaches for LLMs focus on unstructured
pruning, which requires specific hardware support for acceleration. In this
paper, we propose a novel retraining-free structured pruning framework for
LLMs, named FLAP (FLuctuation-based Adaptive Structured Pruning). It is
hardware-friendly by effectively reducing storage and enhancing inference
speed. For effective structured pruning of LLMs, we highlight three critical
elements that demand the utmost attention: formulating structured importance
metrics, adaptively searching the global compressed model, and implementing
compensation mechanisms to mitigate performance loss. First, FLAP determines
whether the output feature map is easily recoverable when a column of weight is
removed, based on the fluctuation pruning metric. Then it standardizes the
importance scores to adaptively determine the global compressed model
structure. At last, FLAP adds additional bias terms to recover the output
feature maps using the baseline values. We thoroughly evaluate our approach on
a variety of language benchmarks. Without any retraining, our method
significantly outperforms the state-of-the-art methods, including LLM-Pruner
and the extension of Wanda in structured pruning. The code is released at
https://github.com/CASIA-IVA-Lab/FLAP.
Related papers
- Scaling Law for Post-training after Model Pruning [24.9935656519956]
Large language models (LLMs) based on the Transformer architecture are widely employed across various domains and tasks.
To mitigate this, model pruning techniques have been developed to create more efficient models while maintaining high performance.
This paper investigates the post-training requirements of pruned LLMs and introduces a scaling law to determine the optimal amount of post-training data.
arXiv Detail & Related papers (2024-11-15T15:28:42Z) - Pruning Foundation Models for High Accuracy without Retraining [48.256389781305415]
It is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations.
Post-training pruning methods are proposed to prune LLMs in one-shot without retraining.
Our experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines.
arXiv Detail & Related papers (2024-10-21T01:23:34Z) - Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.
Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration.
Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z) - CFSP: An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information [33.01180010689081]
We introduce an efficient structured pruning framework named CFSP.
We first allocate the sparsity budget across blocks based on their importance and then retain important weights within each block.
Results demonstrate that CFSP outperforms existing methods on diverse models across various sparsity budgets.
arXiv Detail & Related papers (2024-09-20T04:03:27Z) - A Convex-optimization-based Layer-wise Post-training Pruner for Large Language Models [24.185245582500876]
We introduce FISTAPruner, the first post-training pruner based on convex optimization models and algorithms.
FISTAPruner incorporates an intra-layer cumulative error correction mechanism and supports parallel pruning.
We evaluate FISTAPruner on models such as OPT, LLaMA, LLaMA-2, and LLaMA-3 with 125M to 70B parameters under unstructured and 2:4 semi-structured sparsity.
arXiv Detail & Related papers (2024-08-07T12:33:46Z) - Reconstruct the Pruned Model without Any Retraining [23.235907813011174]
We introduce the Linear Interpolation-based Adaptive Reconstruction (LIAR) framework, which is both efficient and effective.
LIAR does not require back-propagation or retraining and is compatible with various pruning criteria and modules.
Our evaluations on benchmarks such as GLUE, SQuAD, WikiText, and common sense reasoning show that LIAR enables a BERT model to maintain 98% accuracy even after removing 50% of its parameters.
arXiv Detail & Related papers (2024-07-18T09:30:44Z) - Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models.
We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.
Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models [53.638791265113625]
Sparsity-Preserved efficient fine-tuning method for large language models.
Code will be made available at https://github.com/Lucky-Lance/SPP.
arXiv Detail & Related papers (2024-05-25T04:55:27Z) - MLPruning: A Multilevel Structured Pruning Framework for
Transformer-based Models [78.45898846056303]
Pruning is an effective method to reduce the memory footprint and computational cost associated with large natural language processing models.
We develop a novel MultiLevel structured Pruning framework, which uses three different levels of structured pruning: head pruning, row pruning, and block-wise sparse pruning.
arXiv Detail & Related papers (2021-05-30T22:00:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.