Compresso: Structured Pruning with Collaborative Prompting Learns
Compact Large Language Models
- URL: http://arxiv.org/abs/2310.05015v2
- Date: Wed, 11 Oct 2023 01:46:35 GMT
- Title: Compresso: Structured Pruning with Collaborative Prompting Learns
Compact Large Language Models
- Authors: Song Guo, Jiahang Xu, Li Lyna Zhang, Mao Yang
- Abstract summary: We introduce a new paradigm for structurally pruning Large Language Models, called Compresso.
Our approach, through the collaboration of the proposed resource-efficient pruning algorithm and the LLM itself, learns optimal pruning decisions during the training process.
In experiments, Compresso significantly outperforms one-shot pruning baselines across various sparsity ratios, achieving up to 2.21%, 11.43%, 7.04%, and 4.81% higher scores on the commonsense reasoning, reading comprehension, MMLU, and BBH benchmarks, respectively.
- Score: 15.471290825100075
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the remarkable success of Large Language Models (LLMs), the massive
size poses significant deployment challenges, particularly on
resource-constrained hardware. While existing LLM compression methods focus on
quantization, pruning remains relatively unexplored due to the high cost of
training-based approaches and data collection challenges. One-shot pruning
methods, although cost-effective and data-free, have become dominant in LLM
pruning, but lead to performance decline under the structured pruning setting.
In this work, we introduce a new paradigm for structurally pruning LLMs, called
Compresso. Our approach, through the collaboration of the proposed
resource-efficient pruning algorithm and the LLM itself, learns optimal pruning
decisions during the training process. Compresso addresses the challenges of
expensive training costs and data collection by incorporating Low-Rank
Adaptation (LoRA) into the $L_0$ regularization during the instruction tuning
process. Then, we further augment the pruning algorithm by introducing a
collaborative prompt that fosters collaboration between the LLM and the pruning
algorithm, significantly boosting the overall performance. To this end,
Compresso prunes LLaMA-7B to 5.4B, maintaining original performance and even
surpassing LLaMA-7B in reading comprehension by 2.62%. Extensive experiments
demonstrate that Compresso significantly outperforms one-shot pruning baselines
across various sparsity ratios, achieving up to 2.21%, 11.43%, 7.04%, and 4.81%
higher scores on the commonsense reasoning, reading comprehension, MMLU, and
BBH benchmarks, respectively.
Related papers
- LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs.
We introduce LLM2, a novel framework that combines an LLM with a process-based verifier.
LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z) - Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.
LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.
We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z) - MoE-I$^2$: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition [32.97035551579975]
We introduce a two-stage compression method tailored for MoE to reduce the model size and decrease the computational cost.
Experiments on Qwen1.5-MoE-A2.7B, DeepSeek-V2-Lite, and Mixtral-8$times$7B demonstrate that our proposed methods can both reduce the model size and enhance inference efficiency.
arXiv Detail & Related papers (2024-11-01T20:37:58Z) - FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications.
FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - Toward Adaptive Large Language Models Structured Pruning via Hybrid-grained Weight Importance Assessment [58.030196381554745]
We introduce the Hybrid-grained Weight Importance Assessment (HyWIA), a novel method that merges fine-grained and coarse-grained evaluations of weight importance for the pruning of large language models (LLMs)
Extensive experiments on LLaMA-V1/V2, Vicuna, Baichuan, and Bloom across various benchmarks demonstrate the effectiveness of HyWIA in pruning LLMs.
arXiv Detail & Related papers (2024-03-16T04:12:50Z) - Enabling Weak LLMs to Judge Response Reliability via Meta Ranking [38.63721941742435]
We propose a novel cross-query-comparison-based method called $textitMeta Ranking$ (MR)
MR assesses reliability by pairwisely ranking the target query-response pair with multiple reference query-response pairs.
We show that MR can enhance strong LLMs' performance in two practical applications: model cascading and instruction tuning.
arXiv Detail & Related papers (2024-02-19T13:57:55Z) - Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via
Instruction Tuning with LITE [62.13435256279566]
Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks.
However, their large size makes their inference slow and computationally expensive.
We show that it enables these layers to acquire 'good' generation ability without affecting the generation ability of the final layer.
arXiv Detail & Related papers (2023-10-28T04:07:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.