Related papers: Streamlining Redundant Layers to Compress Large Language Models

Related papers

A Sliding Layer Merging Method for Efficient Depth-Wise Pruning in LLMs [14.514670828712669]
This paper reveals the "Patch-like" feature relationship between layers in large language models by analyzing the correlation of the outputs of different layers in the reproducing kernel Hilbert space. We propose a sliding layer merging method that dynamically selects and fuses consecutive layers from top to bottom according to a pre-defined similarity threshold. Our method outperforms existing pruning techniques in both zero-shot inference performance and retraining recovery quality after pruning.
arXiv Detail & Related papers (2025-02-26T14:15:24Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
Reassessing Layer Pruning in LLMs: New Insights and Methods [24.394438652261982]
We show that a simple approach, i.e., pruning the final 25% of layers followed by fine-tuning the textttlm_head and the remaining last three layer, yields remarkably strong performance. We release the optimal model weights on Hface, and the code is available on GitHub.
arXiv Detail & Related papers (2024-11-23T13:31:16Z)
LiNeS: Post-training Layer Scaling Prevents Forgetting and Enhances Model Merging [80.17238673443127]
LiNeS is a post-training editing technique designed to preserve pre-trained generalization while enhancing fine-tuned task performance. LiNeS demonstrates significant improvements in both single-task and multi-task settings across various benchmarks in vision and natural language processing.
arXiv Detail & Related papers (2024-10-22T16:26:05Z)
AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models [94.82766517752418]
We propose AlphaPruning, which uses shape metrics to allocate layerwise sparsity ratios in a more theoretically principled manner. Our results show that AlphaPruning prunes LLaMA-7B to 80% sparsity while maintaining reasonable perplexity, marking a first in the literature on LLMs.
arXiv Detail & Related papers (2024-10-14T03:35:11Z)
Chip-Tuning: Classify Before Language Models Say [25.546473157624945]
Chip-tuning is a simple and effective structured pruning framework for classification problems. We show that chip-tuning significantly outperforms previous state-of-the-art baselines in both accuracy and pruning ratio. We also find that chip-tuning could be applied on multimodal models, and could be combined with model finetuning, proving its excellent compatibility.
arXiv Detail & Related papers (2024-10-09T04:35:22Z)
Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research. Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration. Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z)
A deeper look at depth pruning of LLMs [49.30061112976263]
Large Language Models (LLMs) are resource-intensive to train but more costly to deploy in production. Recent work has attempted to prune blocks of LLMs based on cheap proxies for estimating block importance. We show that adaptive metrics exhibit a trade-off in performance between tasks.
arXiv Detail & Related papers (2024-07-23T08:40:27Z)
MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models [20.958265043544603]
Large Language Models (LLMs) grow dramatically in size. There is an increasing trend in compressing and speeding up these models. We propose a hybrid pruning criterion, which appropriately integrates magnitude, activation, and gradient.
arXiv Detail & Related papers (2024-07-16T12:59:44Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
RankAdaptor: Hierarchical Rank Allocation for Efficient Fine-Tuning Pruned LLMs via Performance Model [4.926801686932735]
We introduce RankAdaptor, a hierarchical rank allocation method that enables efficient fine-tuning of pruned models. We show that RankAdaptor consistently outperforms state-of-the-art methods across a variety of pruning settings and LLM architectures.
arXiv Detail & Related papers (2024-06-22T04:52:58Z)
LayerMerge: Neural Network Depth Compression through Layer Pruning and Merging [20.774060844559838]
Existing depth compression methods remove redundant non-linear activation functions and merge the consecutive convolution layers into a single layer. These methods suffer from a critical drawback; the kernel size of the merged layers becomes larger. We show that this problem can be addressed by jointly pruning convolution layers and activation functions. We propose LayerMerge, a novel depth compression method that selects which activation layers and convolution layers to remove.
arXiv Detail & Related papers (2024-06-18T17:55:15Z)
BlockPruner: Fine-grained Pruning for Large Language Models [23.523314522663455]
Research indicates certain layers in large language models (LLMs) harbor substantial redundancy, and pruning these layers has minimal impact on the overall performance. We propose a novel, training-free structured pruning approach called BlockPruner. We show that BlockPruner achieves more granular and effective pruning compared to state-of-the-art baselines.
arXiv Detail & Related papers (2024-06-15T11:03:33Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
The Unreasonable Ineffectiveness of the Deeper Layers [5.984361440126354]
We study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs. We find minimal degradation of performance until after a large fraction of the layers are removed. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge.
arXiv Detail & Related papers (2024-03-26T17:20:04Z)
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect [38.148626520751385]
We show that many layers of Large Language Models (LLMs) exhibit high similarity, and some layers play a negligible role in network functionality. We propose a straightforward pruning approach: layer removal, in which we directly delete the redundant layers. Experiments demonstrate that our method, which we call ShortGPT, significantly outperforms previous state-of-the-art (SOTA) methods in model pruning.
arXiv Detail & Related papers (2024-03-06T17:04:18Z)
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation [54.28841287750586]
Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc. Existing solutions such as SparseGPT and Wanda attempt to alleviate this issue through weight pruning. This paper introduces a novel LLM pruning technique dubbed blockwise parameter-efficient sparsity allocation (BESA) by applying a blockwise reconstruction loss.
arXiv Detail & Related papers (2024-02-18T12:44:15Z)
LaCo: Large Language Model Pruning via Layer Collapse [56.92068213969036]
Large language models (LLMs) based on transformer are witnessing a notable trend of size expansion. Existing methods such as model quantization, knowledge distillation, and model pruning are constrained by various issues. We propose a concise layer-wise structured pruner called textitLayer Collapse (LaCo), in which rear model layers collapse into a prior layer.
arXiv Detail & Related papers (2024-02-17T04:16:30Z)
Layer-adaptive sparsity for the Magnitude-based Pruning [88.37510230946478]
We propose a novel importance score for global pruning, coined layer-adaptive magnitude-based pruning (LAMP) score. LAMP consistently outperforms popular existing schemes for layerwise sparsity selection.
arXiv Detail & Related papers (2020-10-15T09:14:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.