The Structural Scalpel: Automated Contiguous Layer Pruning for Large Language Models
- URL: http://arxiv.org/abs/2510.23652v1
- Date: Sat, 25 Oct 2025 16:40:17 GMT
- Title: The Structural Scalpel: Automated Contiguous Layer Pruning for Large Language Models
- Authors: Yao Lu, Yuqi Li, Wenbin Xie, Shanqing Yu, Qi Xuan, Zhaowei Zhu, Shiping Wen,
- Abstract summary: We propose CLP, a novel continuous layer pruning framework for large language models.<n>CLP uses differentiable concave gate algorithm that automatically identifies the best continuous layer segments for pruning.<n>CLP can be seamlessly combined with quantization to further compress the model with only a slight performance loss.
- Score: 33.90597962418094
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although large language models (LLMs) have achieved revolutionary breakthroughs in many fields, their large model size and high computational cost pose significant challenges for practical deployment on resource-constrained edge devices. To this end, layer pruning has been proposed to reduce the computational overhead by directly removing redundant layers. However, existing layer pruning methods typically rely on hand-crafted metrics to evaluate and remove individual layers, while ignoring the dependencies between layers. This can disrupt the model's information flow and severely degrade performance. To address these issues, we propose CLP, a novel continuous layer pruning framework that introduces two key innovations: a differentiable concave gate algorithm that automatically identifies the best continuous layer segments for pruning via gradient-based optimization; and a cutoff endpoint tuning strategy that effectively restores model performance by fine-tuning only the layers adjacent to the pruned segments. Extensive experiments across multiple model architectures (including LLaMA2, LLaMA3 and Qwen) and sizes (from $7$B to $70$B parameters) show that CLP significantly outperforms existing state-of-the-art baselines. For example, at a pruning rate of $20\%$, CLP achieves an average performance retention of $95.34\%$ on LLaMA3-70B, outperforming baselines by $4.29\%$-$30.52\%$. Furthermore, CLP can be seamlessly combined with quantization to further compress the model with only a slight performance loss.
Related papers
- Curvature-Weighted Capacity Allocation: A Minimum Description Length Framework for Layer-Adaptive Large Language Model Optimization [8.029535985033485]
Layer-wise capacity in large language models is non-uniform; some layers contribute disproportionately to loss reduction while others are near-redundant.<n>Existing methods for exploiting this non-uniformity, such as influence-function-based layer scoring, produce sensitivity estimates but offer no principled mechanism for translating them into allocation or pruning decisions.<n>We address this gap with a unified, curvature-aware framework grounded in the Minimum Description Length (MDL) principle.
arXiv Detail & Related papers (2026-03-01T04:14:15Z) - Pruning as a Cooperative Game: Surrogate-Assisted Layer Contribution Estimation for Large Language Models [17.818685759025207]
Layer-wise pruning is a commonly employed strategy to mitigate inference costs.<n>This paper proposes a game-theoretic framework that formulates layer pruning as a cooperative game.<n>It achieves more efficient and effective layer-wise pruning for large language models.
arXiv Detail & Related papers (2026-02-08T03:51:36Z) - Understanding and Guiding Layer Placement in Parameter-Efficient Fine-Tuning of Large Language Models [19.448467763421707]
Large language models (LLMs) continue to grow, making parameter-efficient fine-tuning the default strategy for downstream adaptation.<n>Current practice typically applies PEFT uniformly across all layers, with limited understanding or leverage of layer selection.<n>This paper develops a unified projected residual view of PEFT on top of a frozen base model.
arXiv Detail & Related papers (2026-02-03T21:05:55Z) - Pangu Light: Weight Re-Initialization for Pruning and Accelerating LLMs [79.7618807098457]
Large Language Models (LLMs) deliver state-of-the-art capabilities across numerous tasks, but their immense size and inference costs pose significant computational challenges for practical deployment.<n>This paper argues that a critical, often overlooked, aspect in making such aggressive joint pruning viable is the strategic re-initialization and adjustment of remaining weights.<n>We introduce Pangu Light, a framework for LLM acceleration centered around structured pruning and novel weight re-initialization techniques.
arXiv Detail & Related papers (2025-05-26T15:57:08Z) - A Sliding Layer Merging Method for Efficient Depth-Wise Pruning in LLMs [13.000188564679998]
This paper reveals the Patch-like'' feature relationship between layers in large language models by analyzing the correlation of the outputs of different layers in the reproducing kernel Hilbert space.<n>We propose a sliding layer merging method that dynamically selects and fuses consecutive layers from top to bottom according to a pre-defined similarity threshold.<n>Our method outperforms existing pruning techniques in both zero-shot inference performance and retraining recovery quality after pruning.
arXiv Detail & Related papers (2025-02-26T14:15:24Z) - LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.<n>We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks.<n>Experiments conducted on LLaMA, LLaMA-2, LLaMA-3, Vicuna, and Mistral models demonstrate the promising performance of our method in efficiency and effectiveness.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models [54.787308652357794]
FinerCut is a new form of fine-grained layer pruning for transformer networks.
Our approach retains 90% performance of Llama3-8B with 25% layers removed, and 95% performance of Llama3-70B with 30% layers removed, all without fine-tuning or post-pruning reconstruction.
arXiv Detail & Related papers (2024-05-28T14:21:15Z) - LaCo: Large Language Model Pruning via Layer Collapse [56.92068213969036]
Large language models (LLMs) based on transformer are witnessing a notable trend of size expansion.
Existing methods such as model quantization, knowledge distillation, and model pruning are constrained by various issues.
We propose a concise layer-wise structured pruner called textitLayer Collapse (LaCo), in which rear model layers collapse into a prior layer.
arXiv Detail & Related papers (2024-02-17T04:16:30Z) - Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [52.29522018586365]
We study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models.
Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains.
arXiv Detail & Related papers (2023-10-10T15:13:30Z) - ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language
Models [70.45441031021291]
Large Vision-Language Models (LVLMs) can understand the world comprehensively by integrating rich information from different modalities.
LVLMs are often problematic due to their massive computational/energy costs and carbon consumption.
We propose Efficient Coarse-to-Fine LayerWise Pruning (ECoFLaP), a two-stage coarse-to-fine weight pruning approach for LVLMs.
arXiv Detail & Related papers (2023-10-04T17:34:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.