Pruning Large Language Models with Semi-Structural Adaptive Sparse Training
- URL: http://arxiv.org/abs/2407.20584v3
- Date: Wed, 18 Dec 2024 07:14:33 GMT
- Title: Pruning Large Language Models with Semi-Structural Adaptive Sparse Training
- Authors: Weiyu Huang, Yuezhou Hu, Guohao Jian, Jun Zhu, Jianfei Chen,
- Abstract summary: Adaptive Sparse Trainer (AST) is a novel and efficient retraining framework tailored for semi-structured sparse models.
AST reduces the perplexity and zero-shot accuracy gap between dense and 2:4 semi-structured sparse models to 0.6 and 1.16%, respectively.
- Score: 17.381160429641316
- License:
- Abstract: The remarkable success of Large Language Models (LLMs) relies heavily on their substantial scale, which poses significant challenges during model deployment in terms of latency and memory consumption. Recently, numerous studies have attempted to compress LLMs using one-shot pruning methods. However, these methods often suffer from considerable performance degradation on complex language understanding tasks, raising concerns about the feasibility of pruning in LLMs. To address this issue, we propose Adaptive Sparse Trainer (AST), a novel and efficient retraining framework tailored for semi-structured sparse models. AST enables models to learn optimal masks during the weight update process without incurring additional computational overhead. Furthermore, we demonstrate that incorporating knowledge distillation significantly improves retraining efficiency and enhances model performance under fixed computational constraints. Additionally, a supplementary set of well-initialized parameters is integrated to further augment the model's efficacy. AST achieves state-of-the-art performance with minimal training cost. When applied to the LLaMA2-7B model, AST reduces the perplexity and zero-shot accuracy gap between dense and 2:4 semi-structured sparse models to 0.6 and 1.16%, respectively, utilizing less than 0.4% of the pretraining tokens and GPU hours. Our work demonstrates the feasibility of deploying semi-structured sparse LLMs and offers a promising alternative for achieving highly compressed models when combined with existing quantization techniques.
Related papers
- LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.
Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.
We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [95.32315448601241]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)
RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.
Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.
LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.
We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z) - TRAWL: Tensor Reduced and Approximated Weights for Large Language Models [11.064868044313855]
We introduce TRAWL (Tensor Reduced and Approximated Weights for Large Language Models), a technique that applies tensor decomposition across multiple weight matrices to effectively denoise LLMs by capturing global structural patterns.
Our experiments show that TRAWL improves model performance by up to 16% over baseline models on benchmark datasets, without requiring additional data, training, or fine-tuning.
arXiv Detail & Related papers (2024-06-25T04:01:32Z) - SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models [53.638791265113625]
Sparsity-Preserved efficient fine-tuning method for large language models.
Code will be made available at https://github.com/Lucky-Lance/SPP.
arXiv Detail & Related papers (2024-05-25T04:55:27Z) - Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks [5.536630285985836]
We introduce parameter-efficient sparsity crafting (PESC)
PESC crafts dense models into sparse models using the mixture-of-experts (MoE) architecture.
Our best sparse model outperforms other sparse and dense models and exhibits superior general capabilities compared to GP3.5.
arXiv Detail & Related papers (2024-01-05T09:58:09Z) - Retrieval-based Knowledge Transfer: An Effective Approach for Extreme
Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks.
However, the massive size of these models poses huge challenges for their deployment in real-world applications.
We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z) - Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [52.29522018586365]
We study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models.
Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains.
arXiv Detail & Related papers (2023-10-10T15:13:30Z) - FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only
Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment.
We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs.
We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z) - A Fast and Efficient Conditional Learning for Tunable Trade-Off between
Accuracy and Robustness [11.35810118757863]
Existing models that achieve state-of-the-art (SOTA) performance on both clean and adversarially-perturbed images rely on convolution operations conditioned with feature-wise linear modulation (FiLM) layers.
We present a fast learnable once-for-all adversarial training (FLOAT) algorithm, which instead of the existing FiLM-based conditioning, presents a unique weight conditioned learning that requires no additional layer.
In particular, we add scaled noise to the weight tensors that enables a trade-off between clean and adversarial performance.
arXiv Detail & Related papers (2022-03-28T19:25:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.