GradPruner: Gradient-Guided Layer Pruning Enabling Efficient Fine-Tuning and Inference for LLMs
- URL: http://arxiv.org/abs/2601.19503v1
- Date: Tue, 27 Jan 2026 11:41:26 GMT
- Title: GradPruner: Gradient-Guided Layer Pruning Enabling Efficient Fine-Tuning and Inference for LLMs
- Authors: Wei Huang, Anda Cheng, Yinggui Wang,
- Abstract summary: GradPruner can prune layers of Large Language Models guided by gradients in the early stages of fine-tuning.<n>Results demonstrate that GradPruner has achieved a parameter reduction of 40% with only a 0.99% decrease in accuracy.
- Score: 10.61152477422108
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning Large Language Models (LLMs) with downstream data is often considered time-consuming and expensive. Structured pruning methods are primarily employed to improve the inference efficiency of pre-trained models. Meanwhile, they often require additional time and memory for training, knowledge distillation, structure search, and other strategies, making efficient model fine-tuning challenging to achieve. To simultaneously enhance the training and inference efficiency of downstream task fine-tuning, we introduce GradPruner, which can prune layers of LLMs guided by gradients in the early stages of fine-tuning. GradPruner uses the cumulative gradients of each parameter during the initial phase of fine-tuning to compute the Initial Gradient Information Accumulation Matrix (IGIA-Matrix) to assess the importance of layers and perform pruning. We sparsify the pruned layers based on the IGIA-Matrix and merge them with the remaining layers. Only elements with the same sign are merged to reduce interference from sign variations. We conducted extensive experiments on two LLMs across eight downstream datasets. Including medical, financial, and general benchmark tasks. The results demonstrate that GradPruner has achieved a parameter reduction of 40% with only a 0.99% decrease in accuracy. Our code is publicly available.
Related papers
- GradMAP: Faster Layer Pruning with Gradient Metric and Projection Compensation [23.236542656505417]
GradMAP is a faster layer pruning method with textbfGradient textbfMetric textbfAnd textbfProjection compensation.<n>In this study, we propose GradMAP, a faster layer pruning method with textbfGradient textbfMetric textbfAnd textbfProjection compensation.
arXiv Detail & Related papers (2026-02-16T11:14:02Z) - Layer-wise LoRA fine-tuning: a similarity metric approach [0.6323908398583081]
Low-Rank Adaptation (LoRA) techniques aim to reduce the computational cost of this process by freezing the pre-trained model and updating a smaller number of parameters.<n>We address the previous problem by systematically selecting only a few layers to fine-tune using LoRA or its variants.<n>We reduce the trainable parameters in LoRA-based techniques by up to 50%, while maintaining the predictive performance across different models and tasks.
arXiv Detail & Related papers (2026-02-05T18:38:53Z) - Compress to Impress: Efficient LLM Adaptation Using a Single Gradient Step on 100 Samples [57.67658635348395]
LASER's exhaustive, per-matrix search makes it impractical for rapid deployment.<n>We show that combining these findings yields a fast and robust adaptation algorithm for downstream tasks.
arXiv Detail & Related papers (2025-10-23T17:58:01Z) - TreeLoRA: Efficient Continual Learning via Layer-Wise LoRAs Guided by a Hierarchical Gradient-Similarity Tree [52.44403214958304]
In this paper, we introduce TreeLoRA, a novel approach that constructs layer-wise adapters by leveraging hierarchical gradient similarity.<n>To reduce the computational burden of task similarity estimation, we employ bandit techniques to develop an algorithm based on lower confidence bounds.<n> experiments on both vision transformers (ViTs) and large language models (LLMs) demonstrate the effectiveness and efficiency of our approach.
arXiv Detail & Related papers (2025-06-12T05:25:35Z) - SDMPrune: Self-Distillation MLP Pruning for Efficient Large Language Models [3.962074007736394]
We introduce a self-distillation loss during the pruning phase (rather than post-training) to fully exploit the predictions of the original model.<n>We demonstrate that our method significantly outperforms existing pruning methods.<n>Our method achieves very competitive performance among 1B-scale open source LLMs.
arXiv Detail & Related papers (2025-06-10T02:24:32Z) - Two-Stage Regularization-Based Structured Pruning for LLMs [32.65416603453818]
TRSP: Two-Stage Regularization-Based Structured Pruning for Large Language Models.<n>We show that TRSP outperforms strong layer-wise structured pruning methods without requiring retraining.<n>As a layer-wise pruning method, it delivers notable end-to-end acceleration.
arXiv Detail & Related papers (2025-05-23T12:40:59Z) - LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.<n>Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z) - Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models [56.00251589760559]
Large language models (LLMs) can act as gradient priors in a zero-shot setting.<n>We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.<n>Experiments indicate that LM-GC surpasses existing state-of-the-art lossless compression methods.
arXiv Detail & Related papers (2024-09-26T13:38:33Z) - Enhancing Large Language Model Performance with Gradient-Based Parameter Selection [32.88329156118533]
Gradient-Mask Tuning (GMT) is a method that selectively updates parameters during training based on their gradient information.<n>Our empirical results across various tasks demonstrate that GMT not only outperforms traditional fine-tuning methods but also elevates the upper limits of LLM performance.
arXiv Detail & Related papers (2024-06-21T17:42:52Z) - Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.<n>We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks.<n>Experiments conducted on LLaMA, LLaMA-2, LLaMA-3, Vicuna, and Mistral models demonstrate the promising performance of our method in efficiency and effectiveness.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.