LaCo: Large Language Model Pruning via Layer Collapse
- URL: http://arxiv.org/abs/2402.11187v1
- Date: Sat, 17 Feb 2024 04:16:30 GMT
- Title: LaCo: Large Language Model Pruning via Layer Collapse
- Authors: Yifei Yang, Zouying Cao, Hai Zhao
- Abstract summary: Large language models (LLMs) based on transformer are witnessing a notable trend of size expansion.
We propose a concise layer-wise pruning method called textitLayer Collapse (LaCo), in which rear model layers collapse into a prior layer.
Experiments show that our method maintains an average task performance of over 80% at pruning ratios of 25-30%.
- Score: 63.973142426228016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) based on transformer are witnessing a notable
trend of size expansion, which brings considerable costs to both model training
and inference. However, existing methods such as model quantization, knowledge
distillation, and model pruning are constrained by various issues, including
hardware support limitations, the need for extensive training, and alterations
to the internal structure of the model. In this paper, we propose a concise
layer-wise pruning method called \textit{Layer Collapse (LaCo)}, in which rear
model layers collapse into a prior layer, enabling a rapid reduction in model
size while preserving the model structure. Comprehensive experiments show that
our method maintains an average task performance of over 80\% at pruning ratios
of 25-30\%, significantly outperforming existing state-of-the-art structured
pruning methods. We also conduct post-training experiments to confirm that the
proposed pruning method effectively inherits the parameters of the original
model. Finally, we discuss our motivation from the perspective of layer-wise
similarity and evaluate the performance of the pruned LLMs across various
pruning ratios.
Related papers
- SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction.
SMILE allows for the upscaling of source models into an MoE model without extra data or further training.
We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z) - Pruning Large Language Models with Semi-Structural Adaptive Sparse Training [17.381160429641316]
We propose a pruning pipeline for semi-structured sparse models via retraining, termed Adaptive Sparse Trainer (AST)
AST transforms dense models into sparse ones by applying decay to masked weights while allowing the model to adaptively select masks throughout the training process.
Our work demonstrates the feasibility of deploying semi-structured sparse large language models and introduces a novel method for achieving highly compressed models.
arXiv Detail & Related papers (2024-07-30T06:33:44Z) - Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains.
We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - The LLM Surgeon [33.90611088414982]
We explore data-driven compression of existing pretrained models as an alternative to training smaller models from scratch.
We provide a general framework for unstructured, semi-structured and structured pruning and improve upon weight updates to capture more correlations between weights.
Our method can prune rows and columns from a range of OPT models and Llamav2-7B by 20%-30%, with a negligible loss in performance.
arXiv Detail & Related papers (2023-12-28T18:59:09Z) - Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [52.29522018586365]
We study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models.
Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains.
arXiv Detail & Related papers (2023-10-10T15:13:30Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - Gradient-based Intra-attention Pruning on Pre-trained Language Models [21.444503777215637]
We propose a structured pruning method GRAIN (Gradient-based Intra-attention pruning)
GRAIN inspects and prunes intra-attention structures, which greatly expands the structure search space and enables more flexible models.
Experiments on GLUE, SQuAD, and CoNLL 2003 show that GRAIN notably outperforms other methods, especially in the high sparsity regime.
arXiv Detail & Related papers (2022-12-15T06:52:31Z) - On the Effect of Dropping Layers of Pre-trained Transformer Models [35.25025837133909]
We explore strategies to drop layers in pre-trained models, and observe the effect of pruning on downstream GLUE tasks.
We were able to prune BERT, RoBERTa and XLNet models up to 40%, while maintaining up to 98% of their original performance.
Our experiments yield interesting observations such as, (i) the lower layers are most critical to maintain downstream task performance, (ii) some tasks such as paraphrase detection and sentence similarity are more robust to the dropping of layers, and (iii) models trained using a different objective function exhibit different learning patterns and w.r.t the layer dropping
arXiv Detail & Related papers (2020-04-08T07:09:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.