Related papers: Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models

Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models

URL: http://arxiv.org/abs/2310.05015v2
Date: Wed, 11 Oct 2023 01:46:35 GMT
Title: Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models
Authors: Song Guo, Jiahang Xu, Li Lyna Zhang, Mao Yang
Abstract summary: We introduce a new paradigm for structurally pruning Large Language Models, called Compresso. Our approach, through the collaboration of the proposed resource-efficient pruning algorithm and the LLM itself, learns optimal pruning decisions during the training process. In experiments, Compresso significantly outperforms one-shot pruning baselines across various sparsity ratios, achieving up to 2.21%, 11.43%, 7.04%, and 4.81% higher scores on the commonsense reasoning, reading comprehension, MMLU, and BBH benchmarks, respectively.
Score: 15.471290825100075
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the remarkable success of Large Language Models (LLMs), the massive size poses significant deployment challenges, particularly on resource-constrained hardware. While existing LLM compression methods focus on quantization, pruning remains relatively unexplored due to the high cost of training-based approaches and data collection challenges. One-shot pruning methods, although cost-effective and data-free, have become dominant in LLM pruning, but lead to performance decline under the structured pruning setting. In this work, we introduce a new paradigm for structurally pruning LLMs, called Compresso. Our approach, through the collaboration of the proposed resource-efficient pruning algorithm and the LLM itself, learns optimal pruning decisions during the training process. Compresso addresses the challenges of expensive training costs and data collection by incorporating Low-Rank Adaptation (LoRA) into the $L_0$ regularization during the instruction tuning process. Then, we further augment the pruning algorithm by introducing a collaborative prompt that fosters collaboration between the LLM and the pruning algorithm, significantly boosting the overall performance. To this end, Compresso prunes LLaMA-7B to 5.4B, maintaining original performance and even surpassing LLaMA-7B in reading comprehension by 2.62%. Extensive experiments demonstrate that Compresso significantly outperforms one-shot pruning baselines across various sparsity ratios, achieving up to 2.21%, 11.43%, 7.04%, and 4.81% higher scores on the commonsense reasoning, reading comprehension, MMLU, and BBH benchmarks, respectively.

Related papers

SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models [17.483183039447564]
This paper introduces Sparse Expert Activation Pruning (SEAP), a training-free pruning method that selectively retains task-relevant parameters to reduce inference overhead. Experimental results demonstrate that SEAP significantly reduces computational overhead while maintaining competitive accuracy.
arXiv Detail & Related papers (2025-03-10T17:59:03Z)
LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs. We introduce LLM2, a novel framework that combines an LLM with a process-based verifier. LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z)
Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities. LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands. We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z)
MoE-I$^2$: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition [32.97035551579975]
We introduce a two-stage compression method tailored for MoE to reduce the model size and decrease the computational cost. Experiments on Qwen1.5-MoE-A2.7B, DeepSeek-V2-Lite, and Mixtral-8$times$7B demonstrate that our proposed methods can both reduce the model size and enhance inference efficiency.
arXiv Detail & Related papers (2024-11-01T20:37:58Z)
FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications. FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
LLMEmbed: Rethinking Lightweight LLM's Genuine Function in Text Classification [13.319594321038926]
We propose a simple and effective transfer learning strategy, namely LLMEmbed, to address this classical but challenging task. We perform extensive experiments on publicly available datasets, and the results show that LLMEmbed achieves strong performance while enjoys low training overhead.
arXiv Detail & Related papers (2024-06-06T03:46:59Z)
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models. It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z)
Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration [70.09561665520043]
We propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans. We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems. Experiments on Over-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents.
arXiv Detail & Related papers (2024-05-23T08:33:19Z)
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process. We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals. The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z)
Toward Adaptive Large Language Models Structured Pruning via Hybrid-grained Weight Importance Assessment [58.030196381554745]
We introduce the Hybrid-grained Weight Importance Assessment (HyWIA), a novel method that merges fine-grained and coarse-grained evaluations of weight importance for the pruning of large language models (LLMs) Extensive experiments on LLaMA-V1/V2, Vicuna, Baichuan, and Bloom across various benchmarks demonstrate the effectiveness of HyWIA in pruning LLMs.
arXiv Detail & Related papers (2024-03-16T04:12:50Z)
Enabling Weak LLMs to Judge Response Reliability via Meta Ranking [38.63721941742435]
We propose a novel cross-query-comparison-based method called $textitMeta Ranking$ (MR) MR assesses reliability by pairwisely ranking the target query-response pair with multiple reference query-response pairs. We show that MR can enhance strong LLMs' performance in two practical applications: model cascading and instruction tuning.
arXiv Detail & Related papers (2024-02-19T13:57:55Z)
Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE [62.13435256279566]
Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks. However, their large size makes their inference slow and computationally expensive. We show that it enables these layers to acquire 'good' generation ability without affecting the generation ability of the final layer.
arXiv Detail & Related papers (2023-10-28T04:07:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.