Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining
- URL: http://arxiv.org/abs/2407.19126v1
- Date: Fri, 26 Jul 2024 23:53:59 GMT
- Title: Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining
- Authors: Jianwei Li, Yijun Dong, Qi Lei,
- Abstract summary: We simplify the pruning process for Transformer-based large language models (LLMs)
We propose two inference-aware pruning criteria derived from the optimization perspective of output approximation.
We also introduce a two-step reconstruction technique to mitigate pruning errors without model retraining.
- Score: 16.026565606764954
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: To remove redundant components of large language models (LLMs) without incurring significant computational costs, this work focuses on single-shot pruning without a retraining phase. We simplify the pruning process for Transformer-based LLMs by identifying a depth-2 pruning structure that functions independently. Additionally, we propose two inference-aware pruning criteria derived from the optimization perspective of output approximation, which outperforms traditional training-aware metrics such as gradient and Hessian. We also introduce a two-step reconstruction technique to mitigate pruning errors without model retraining. Experimental results demonstrate that our approach significantly reduces computational costs and hardware requirements while maintaining superior performance across various datasets and models.
Related papers
- Efficient Diffusion as Low Light Enhancer [63.789138528062225]
Reflectance-Aware Trajectory Refinement (RATR) is a simple yet effective module to refine the teacher trajectory using the reflectance component of images.
textbfReflectance-aware textbfDiffusion with textbfDistilled textbfTrajectory (textbfReDDiT) is an efficient and flexible distillation framework tailored for Low-Light Image Enhancement (LLIE)
arXiv Detail & Related papers (2024-10-16T08:07:18Z) - DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models [62.98273649512654]
Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks.
Increased memory and computational costs associated with these models pose significant challenges for deployment on resource-limited devices.
We propose a novel approach that relaxes the constraint imposed by regular structural pruning methods.
arXiv Detail & Related papers (2024-10-15T18:51:18Z) - A Convex-optimization-based Layer-wise Post-training Pruner for Large Language Models [24.185245582500876]
We introduce FISTAPruner, the first post-training pruner based on convex optimization models and algorithms.
FISTAPruner incorporates an intra-layer cumulative error correction mechanism and supports parallel pruning.
We evaluate FISTAPruner on models such as OPT, LLaMA, LLaMA-2, and LLaMA-3 with 125M to 70B parameters under unstructured and 2:4 semi-structured sparsity.
arXiv Detail & Related papers (2024-08-07T12:33:46Z) - Reconstruct the Pruned Model without Any Retraining [23.235907813011174]
We introduce the Linear Interpolation-based Adaptive Reconstruction (LIAR) framework, which is both efficient and effective.
LIAR does not require back-propagation or retraining and is compatible with various pruning criteria and modules.
Our evaluations on benchmarks such as GLUE, SQuAD, WikiText, and common sense reasoning show that LIAR enables a BERT model to maintain 98% accuracy even after removing 50% of its parameters.
arXiv Detail & Related papers (2024-07-18T09:30:44Z) - Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models.
We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.
Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models [14.310720048047136]
ALPS is an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned gradient conjugate-based post-processing step.
Our approach incorporates novel techniques to accelerate and theoretically guarantee convergence while leveraging vectorization and GPU parallelism for efficiency.
On the OPT-30B model with 70% sparsity, ALPS achieves a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to existing methods.
arXiv Detail & Related papers (2024-06-12T02:57:41Z) - SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models [53.638791265113625]
Sparsity-Preserved efficient fine-tuning method for large language models.
Code will be made available at https://github.com/Lucky-Lance/SPP.
arXiv Detail & Related papers (2024-05-25T04:55:27Z) - LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights [2.8461446020965435]
We introduce LD-Pruner, a novel performance-preserving structured pruning method for compressing Latent Diffusion Models.
We demonstrate the effectiveness of our approach on three different tasks: text-to-image (T2I) generation, Unconditional Image Generation (UIG) and Unconditional Audio Generation (UAG)
arXiv Detail & Related papers (2024-04-18T06:35:37Z) - Advancing Model Pruning via Bi-level Optimization [89.88761425199598]
iterative magnitude pruning (IMP) is the predominant pruning method to successfully find 'winning tickets'
One-shot pruning methods have been developed, but these schemes are usually unable to find winning tickets as good as IMP.
We show that the proposed bi-level optimization-oriented pruning method (termed BiP) is a special class of BLO problems with a bi-linear problem structure.
arXiv Detail & Related papers (2022-10-08T19:19:29Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.