Related papers: Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective

Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective

URL: http://arxiv.org/abs/2502.14770v1
Date: Thu, 20 Feb 2025 17:51:10 GMT
Title: Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective
Authors: Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Fei Chao, Rongrong Ji,
Abstract summary: We address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective.<n>This refers to the cumulative effect of reconstruction errors throughout the sparsification process.<n>We derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue.
Score: 55.90119819642064
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective. Specifically, we identify a critical issue of ''$\textbf{reconstruction error explosion}$'' in existing LLMs sparsification methods. This refers to the cumulative effect of reconstruction errors throughout the sparsification process, where errors from earlier layers propagate and amplify in subsequent layers. As a result, the overall reconstruction error increases significantly, leading to a substantial degradation in model performance. Through theoretical analysis, we derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue. Our method uses a monotonically increasing arithmetic progression, reducing the process of determining sparsity rates for multiple layers to the determination of a single common difference hyperparameter. Remarkably, this allows for the optimal layer-wise sparsity rates to be identified with just a few trials. Both our theoretical analysis and experimental results demonstrate that this sparsity allocation scheme is near optimal. Extensive experiments show that our method significantly improves the performance of sparse LLMs across various architectures, outperforming existing layer-wise sparsity methods. Furthermore, it enhances the performance of various compression techniques and is applicable to vision and multimodal models. Notably, our method achieves a reduction of 52.10 in perplexity for the 70$\%$ sparse LLaMA2-7B model obtained via Wanda, improves average zero-shot accuracy by 10.50$\%$, and delivers speedups of 2.63$\times$ and 2.23$\times$ on CPU and GPU, respectively.

Related papers

$\ abla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space [71.23672814629448]
$nabla$-Reasoner is an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop.<n>$nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark.
arXiv Detail & Related papers (2026-03-05T08:42:54Z)
Unbiased Gradient Low-Rank Projection [32.57081286181632]
A popular strategy involves gradient low-rank projection, storing only the projected states, with GaLore being a representative example.<n>This paper investigates the layerwise sampling technique for debiasing low-rank projection mechanisms.<n>An instantiation of the paradigm gives rise to a novel and unbiased low-rank optimization method built upon GaLore's mechanism and the Muon algorithm.
arXiv Detail & Related papers (2025-10-20T17:59:25Z)
Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLMs) [45.81187493164445]
Recent developments in deep learning optimization have brought about radically new algorithms.<n>These algorithms are based on the Linear Minimization Oracle (LMO) framework.<n>We propose a new LMO-based method called $sf Gluon$, capturing prior theoretically analyzed methods as special cases.
arXiv Detail & Related papers (2025-05-19T17:50:45Z)
Efficient Jailbreaking of Large Models by Freeze Training: Lower Layers Exhibit Greater Sensitivity to Harmful Content [11.626522946410596]
This study conducts sampling and normalization of the parameters of the Large Language Models to generate visual representations and heatmaps of parameter distributions. Based on this finding, we employ a Freeze training strategy, selectively performing Supervised Fine-Tuning only on the lower layers. Experimental results demonstrate that this method significantly reduces training duration and GPU memory consumption while maintaining a high jailbreak success rate and a high harm score.
arXiv Detail & Related papers (2025-02-28T11:07:41Z)
A Sliding Layer Merging Method for Efficient Depth-Wise Pruning in LLMs [14.514670828712669]
This paper reveals the "Patch-like" feature relationship between layers in large language models by analyzing the correlation of the outputs of different layers in the reproducing kernel Hilbert space. We propose a sliding layer merging method that dynamically selects and fuses consecutive layers from top to bottom according to a pre-defined similarity threshold. Our method outperforms existing pruning techniques in both zero-shot inference performance and retraining recovery quality after pruning.
arXiv Detail & Related papers (2025-02-26T14:15:24Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
Leveraging the true depth of LLMs [46.81174316936993]
Large Language Models (LLMs) demonstrate remarkable capabilities at the cost of high compute requirements.<n>Recent studies have demonstrated that intermediate layers in LLMs can be removed or reordered without substantial accuracy loss.<n>We propose a novel method that groups consecutive layers into pairs evaluated in parallel.
arXiv Detail & Related papers (2025-02-05T00:26:27Z)
Efficient Diffusion as Low Light Enhancer [63.789138528062225]
Reflectance-Aware Trajectory Refinement (RATR) is a simple yet effective module to refine the teacher trajectory using the reflectance component of images. textbfReflectance-aware textbfDiffusion with textbfDistilled textbfTrajectory (textbfReDDiT) is an efficient and flexible distillation framework tailored for Low-Light Image Enhancement (LLIE)
arXiv Detail & Related papers (2024-10-16T08:07:18Z)
Q-VLM: Post-training Quantization for Large Vision-Language Models [73.19871905102545]
We propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference. We mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy. Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLaVA model without performance degradation.
arXiv Detail & Related papers (2024-10-10T17:02:48Z)
On Effects of Steering Latent Representation for Large Language Model Unlearning [4.058064008234271]
Representation Misdirection for Unlearning (RMU) is an effective method for large language model (LLM) unlearning.<n>We show that steering forget representations in the intermediate layer reduces token confidence, causing LLMs to generate wrong or nonsense responses.<n>We propose Adaptive RMU--a simple yet effective alternative method that makes unlearning effective with most layers.
arXiv Detail & Related papers (2024-08-12T15:24:50Z)
The Unreasonable Ineffectiveness of the Deeper Layers [5.984361440126354]
We study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs. We find minimal degradation of performance until after a large fraction of the layers are removed. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge.
arXiv Detail & Related papers (2024-03-26T17:20:04Z)
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect [38.148626520751385]
We show that many layers of Large Language Models (LLMs) exhibit high similarity, and some layers play a negligible role in network functionality. We propose a straightforward pruning approach: layer removal, in which we directly delete the redundant layers. Experiments demonstrate that our method, which we call ShortGPT, significantly outperforms previous state-of-the-art (SOTA) methods in model pruning.
arXiv Detail & Related papers (2024-03-06T17:04:18Z)
Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity [88.62935593360162]
Large Language Models (LLMs) are renowned for their remarkable performance across diverse domains. We introduce a novel LLM pruning methodology that incorporates a tailored set of non-uniform layerwise sparsity ratios, termed as Outlier Weighed Layerwise sparsity (OWL) OWL exhibits a remarkable performance gain, surpassing the state-of-the-art Wanda and SparseGPT by 61.22 and 6.80 perplexity at a high sparsity level of 70%, respectively.
arXiv Detail & Related papers (2023-10-08T14:22:58Z)
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models [70.45441031021291]
Large Vision-Language Models (LVLMs) can understand the world comprehensively by integrating rich information from different modalities. LVLMs are often problematic due to their massive computational/energy costs and carbon consumption. We propose Efficient Coarse-to-Fine LayerWise Pruning (ECoFLaP), a two-stage coarse-to-fine weight pruning approach for LVLMs.
arXiv Detail & Related papers (2023-10-04T17:34:00Z)
Intermediate Layer Optimization for Inverse Problems using Deep Generative Models [86.29330440222199]
ILO is a novel optimization algorithm for solving inverse problems with deep generative models. We empirically show that our approach outperforms state-of-the-art methods introduced in StyleGAN-2 and PULSE for a wide range of inverse problems.
arXiv Detail & Related papers (2021-02-15T06:52:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.