Related papers: Fluctuation-based Adaptive Structured Pruning for Large Language Models

Fluctuation-based Adaptive Structured Pruning for Large Language Models

URL: http://arxiv.org/abs/2312.11983v1
Date: Tue, 19 Dec 2023 09:23:48 GMT
Title: Fluctuation-based Adaptive Structured Pruning for Large Language Models
Authors: Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang
Abstract summary: FLAP (FLuctuation-based Adaptive Structured Pruning) is a retraining-free structured pruning framework for Large Language Models. It is hardware-friendly by effectively reducing storage and enhancing inference speed.
Score: 44.217363567065
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Network Pruning is a promising way to address the huge computing resource demands of the deployment and inference of Large Language Models (LLMs). Retraining-free is important for LLMs' pruning methods. However, almost all of the existing retraining-free pruning approaches for LLMs focus on unstructured pruning, which requires specific hardware support for acceleration. In this paper, we propose a novel retraining-free structured pruning framework for LLMs, named FLAP (FLuctuation-based Adaptive Structured Pruning). It is hardware-friendly by effectively reducing storage and enhancing inference speed. For effective structured pruning of LLMs, we highlight three critical elements that demand the utmost attention: formulating structured importance metrics, adaptively searching the global compressed model, and implementing compensation mechanisms to mitigate performance loss. First, FLAP determines whether the output feature map is easily recoverable when a column of weight is removed, based on the fluctuation pruning metric. Then it standardizes the importance scores to adaptively determine the global compressed model structure. At last, FLAP adds additional bias terms to recover the output feature maps using the baseline values. We thoroughly evaluate our approach on a variety of language benchmarks. Without any retraining, our method significantly outperforms the state-of-the-art methods, including LLM-Pruner and the extension of Wanda in structured pruning. The code is released at https://github.com/CASIA-IVA-Lab/FLAP.

Related papers

Sample-aware Adaptive Structured Pruning for Large Language Models [14.605017410864583]
This study introduces AdaPruner, a sample-aware adaptive structured pruning framework for large language models (LLMs) Specifically, AdaPruner effectively removes redundant parameters from LLMs by constructing a structured pruning solution space. At a 20% pruning ratio, the model pruned with AdaPruner maintains 97% of the performance of the unpruned model.
arXiv Detail & Related papers (2025-03-08T12:00:21Z)
Progressive Binarization with Semi-Structured Pruning for LLMs [36.32239429974179]
Large language models (LLMs) have achieved remarkable success in natural language processing tasks. Their high computational and memory demands pose challenges for deployment on resource-constrained devices. We propose a Progressive Binarization with Semi-Structured Pruning (PBS$2$P) method for LLM compression.
arXiv Detail & Related papers (2025-02-03T13:30:29Z)
FASP: Fast and Accurate Structured Pruning of Large Language Models [24.185245582500876]
We introduce FASP (Fast and Accurate Structured Pruning), a novel structured pruning framework for large language models (LLMs) FASP employs a distinctive pruning structure that interlinks sequential layers, allowing for the removal of columns in one layer while simultaneously eliminating corresponding rows in the preceding layer without incurring additional performance loss. We evaluate FASP on the OPT and LLaMA model families, demonstrating superior performance in terms of perplexity and accuracy on downstream tasks compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-01-16T09:38:39Z)
Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities. LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands. We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z)
Scaling Law for Post-training after Model Pruning [24.9935656519956]
Large language models (LLMs) based on the Transformer architecture are widely employed across various domains and tasks. To mitigate this, model pruning techniques have been developed to create more efficient models while maintaining high performance. This paper investigates the post-training requirements of pruned LLMs and introduces a scaling law to determine the optimal amount of post-training data.
arXiv Detail & Related papers (2024-11-15T15:28:42Z)
Pruning Foundation Models for High Accuracy without Retraining [48.256389781305415]
It is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. Post-training pruning methods are proposed to prune LLMs in one-shot without retraining. Our experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines.
arXiv Detail & Related papers (2024-10-21T01:23:34Z)
Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research. Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration. Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z)
CFSP: An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information [33.01180010689081]
We introduce an efficient structured pruning framework named CFSP. We first allocate the sparsity budget across blocks based on their importance and then retain important weights within each block. Results demonstrate that CFSP outperforms existing methods on diverse models across various sparsity budgets.
arXiv Detail & Related papers (2024-09-20T04:03:27Z)
A Convex-optimization-based Layer-wise Post-training Pruner for Large Language Models [24.185245582500876]
We introduce FISTAPruner, the first post-training pruner based on convex optimization models and algorithms. FISTAPruner incorporates an intra-layer cumulative error correction mechanism and supports parallel pruning. We evaluate FISTAPruner on models such as OPT, LLaMA, LLaMA-2, and LLaMA-3 with 125M to 70B parameters under unstructured and 2:4 semi-structured sparsity.
arXiv Detail & Related papers (2024-08-07T12:33:46Z)
Reconstruct the Pruned Model without Any Retraining [23.235907813011174]
We introduce the Linear Interpolation-based Adaptive Reconstruction (LIAR) framework, which is both efficient and effective. LIAR does not require back-propagation or retraining and is compatible with various pruning criteria and modules. Our evaluations on benchmarks such as GLUE, SQuAD, WikiText, and common sense reasoning show that LIAR enables a BERT model to maintain 98% accuracy even after removing 50% of its parameters.
arXiv Detail & Related papers (2024-07-18T09:30:44Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models [53.638791265113625]
Sparsity-Preserved efficient fine-tuning method for large language models. Code will be made available at https://github.com/Lucky-Lance/SPP.
arXiv Detail & Related papers (2024-05-25T04:55:27Z)
MLPruning: A Multilevel Structured Pruning Framework for Transformer-based Models [78.45898846056303]
Pruning is an effective method to reduce the memory footprint and computational cost associated with large natural language processing models. We develop a novel MultiLevel structured Pruning framework, which uses three different levels of structured pruning: head pruning, row pruning, and block-wise sparse pruning.
arXiv Detail & Related papers (2021-05-30T22:00:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.