BlockPruner: Fine-grained Pruning for Large Language Models
- URL: http://arxiv.org/abs/2406.10594v3
- Date: Mon, 26 Aug 2024 14:30:38 GMT
- Title: BlockPruner: Fine-grained Pruning for Large Language Models
- Authors: Longguang Zhong, Fanqi Wan, Ruijun Chen, Xiaojun Quan, Liangzhi Li,
- Abstract summary: Research indicates certain layers in large language models (LLMs) harbor substantial redundancy, and pruning these layers has minimal impact on the overall performance.
We propose a novel, training-free structured pruning approach called BlockPruner.
We show that BlockPruner achieves more granular and effective pruning compared to state-of-the-art baselines.
- Score: 23.523314522663455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid growth in the size and complexity of large language models (LLMs), the costs associated with their training and inference have escalated significantly. Research indicates that certain layers in LLMs harbor substantial redundancy, and pruning these layers has minimal impact on the overall performance. While various layer pruning methods have been developed based on this insight, they generally overlook the finer-grained redundancies within the layers themselves. In this paper, we delve deeper into the architecture of LLMs and demonstrate that finer-grained pruning can be achieved by targeting redundancies in multi-head attention (MHA) and multi-layer perceptron (MLP) blocks. We propose a novel, training-free structured pruning approach called BlockPruner. Unlike existing layer pruning methods, BlockPruner segments each Transformer layer into MHA and MLP blocks. It then assesses the importance of these blocks using perplexity measures and applies a heuristic search for iterative pruning. We applied BlockPruner to LLMs of various sizes and architectures and validated its performance across a wide range of downstream tasks. Experimental results show that BlockPruner achieves more granular and effective pruning compared to state-of-the-art baselines.
Related papers
- DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models [62.98273649512654]
Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks.
Increased memory and computational costs associated with these models pose significant challenges for deployment on resource-limited devices.
We propose a novel approach that relaxes the constraint imposed by regular structural pruning methods.
arXiv Detail & Related papers (2024-10-15T18:51:18Z) - AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models [94.82766517752418]
We propose AlphaPruning, which uses shape metrics to allocate layerwise sparsity ratios in a more theoretically principled manner.
Our results show that AlphaPruning prunes LLaMA-7B to 80% sparsity while maintaining reasonable perplexity, marking a first in the literature on LLMs.
arXiv Detail & Related papers (2024-10-14T03:35:11Z) - A Generic Layer Pruning Method for Signal Modulation Recognition Deep Learning Models [17.996775444294276]
Deep neural networks are becoming the preferred method for signal classification.
They often come with high computational complexity and large model sizes.
We propose a novel layer pruning method to address this challenge.
arXiv Detail & Related papers (2024-06-12T06:46:37Z) - Streamlining Redundant Layers to Compress Large Language Models [21.27944103424621]
This paper introduces LLM-Streamline, a pioneer work on layer pruning for large language models (LLMs)
LLM-Streamline comprises two parts: layer pruning, which removes consecutive layers with the lowest importance based on target sparsity, and layer replacement, a novel module that trains a lightweight network to replace the pruned layers to mitigate performance loss.
Experiments show that LLM-Streamline outperforms both previous and concurrent state-of-the-art pruning methods in terms of both performance and training efficiency.
arXiv Detail & Related papers (2024-03-28T04:12:13Z) - ShortGPT: Layers in Large Language Models are More Redundant Than You Expect [38.148626520751385]
We show that many layers of Large Language Models (LLMs) exhibit high similarity, and some layers play a negligible role in network functionality.
We propose a straightforward pruning approach: layer removal, in which we directly delete the redundant layers.
Experiments demonstrate that our method, which we call ShortGPT, significantly outperforms previous state-of-the-art (SOTA) methods in model pruning.
arXiv Detail & Related papers (2024-03-06T17:04:18Z) - BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation [54.28841287750586]
Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc.
Existing solutions such as SparseGPT and Wanda attempt to alleviate this issue through weight pruning.
This paper introduces a novel LLM pruning technique dubbed blockwise parameter-efficient sparsity allocation (BESA) by applying a blockwise reconstruction loss.
arXiv Detail & Related papers (2024-02-18T12:44:15Z) - LaCo: Large Language Model Pruning via Layer Collapse [56.92068213969036]
Large language models (LLMs) based on transformer are witnessing a notable trend of size expansion.
Existing methods such as model quantization, knowledge distillation, and model pruning are constrained by various issues.
We propose a concise layer-wise structured pruner called textitLayer Collapse (LaCo), in which rear model layers collapse into a prior layer.
arXiv Detail & Related papers (2024-02-17T04:16:30Z) - Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods [5.135352292810664]
We show that simple depth pruning can effectively compress large language models (LLMs)
Our pruning method boosts inference speeds, especially under memory-constrained conditions.
We hope this work can help build compact yet capable LLMs.
arXiv Detail & Related papers (2024-02-05T09:44:49Z) - ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language
Models [70.45441031021291]
Large Vision-Language Models (LVLMs) can understand the world comprehensively by integrating rich information from different modalities.
LVLMs are often problematic due to their massive computational/energy costs and carbon consumption.
We propose Efficient Coarse-to-Fine LayerWise Pruning (ECoFLaP), a two-stage coarse-to-fine weight pruning approach for LVLMs.
arXiv Detail & Related papers (2023-10-04T17:34:00Z) - Efficient Reinforcement Learning in Block MDPs: A Model-free
Representation Learning Approach [73.62265030773652]
We present BRIEE, an algorithm for efficient reinforcement learning in Markov Decision Processes with block-structured dynamics.
BRIEE interleaves latent states discovery, exploration, and exploitation together, and can provably learn a near-optimal policy.
We show that BRIEE is more sample efficient than the state-of-art Block MDP algorithm HOMER RL and other empirical baselines on challenging rich-observation combination lock problems.
arXiv Detail & Related papers (2022-01-31T19:47:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.