Z-Pruner: Post-Training Pruning of Large Language Models for Efficiency without Retraining
- URL: http://arxiv.org/abs/2508.15828v1
- Date: Mon, 18 Aug 2025 16:19:22 GMT
- Title: Z-Pruner: Post-Training Pruning of Large Language Models for Efficiency without Retraining
- Authors: Samiul Basir Bhuiyan, Md. Sazzad Hossain Adib, Mohammed Aman Bhuiyan, Muhammad Rafsan Kabir, Moshiur Farazi, Shafin Rahman, Nabeel Mohammed,
- Abstract summary: Post-training pruning is a promising approach for reducing model size and inference latency without the need for retraining.<n>We introduce Z-Pruner, a novel post-training pruning method designed to induce sparsity in pretrained large language models without retraining.<n>Z-Pruner surpasses state-of-the-art pruning methods that require intensive weight updates.
- Score: 6.578456055730258
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have rapidly advanced in recent years, achieving remarkable performance across a wide range of natural language processing tasks. However, this progress has come at the cost of increasingly large model sizes, which pose significant challenges for deployment, scalability, and energy efficiency. To address these limitations, post-training pruning has emerged as a promising approach for reducing model size and inference latency without the need for retraining. Despite these advantages, many existing pruning methods result in substantial performance degradation or require computationally expensive fine-tuning. In this work, we introduce Z-Pruner, a novel post-training pruning method designed to induce sparsity in pretrained LLMs without any retraining. Unlike conventional approaches, Z-Pruner leverages both weight update magnitudes and activation patterns to identify and eliminate redundant parameters more effectively. Our method is model-agnostic, efficient, and easy to implement. We evaluate Z-Pruner using multiple widely-used LLM architectures, including LLaMA-2, LLaMA-3, and OPT, across a diverse set of standard language benchmarks. Experimental results demonstrate that Z-Pruner surpasses state-of-the-art pruning methods that require intensive weight updates. Specifically, Z-Pruner achieves the lowest perplexity scores and the highest overall average score for zero-shot accuracy. We have made the corresponding codes publicly available at https://github.com/sazzadadib/Z-Pruner.
Related papers
- Gradually Compacting Large Language Models for Reasoning Like a Boiling Frog [72.4168434368873]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, but their substantial size often demands significant computational resources.<n>We propose a gradual compacting method that divides the compression process into multiple fine-grained iterations.<n>This iterative approach-reminiscent of the "boiling frog" effect-enables the model to be progressively compressed without abrupt performance loss.
arXiv Detail & Related papers (2026-02-04T06:56:52Z) - Language Ranker: A Lightweight Ranking framework for LLM Decoding [70.01564145836129]
This paper conceptualizes the decoding process as analogous to the ranking stage in recommendation pipelines.<n>Motivated by this insight, we propose Language Ranker, a novel framework that introduces a lightweight module to rerank candidate responses.<n> Experiments show that Language Ranker achieves performance comparable to large-scale reward models, while requiring only 0.5M additional parameters.
arXiv Detail & Related papers (2025-10-23T17:56:46Z) - RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - Adapt-Pruner: Adaptive Structural Pruning for Efficient Small Language Model Training [27.857935426067076]
Small language models (SLMs) have attracted considerable attention due to their broad range of applications in edge devices.<n>To obtain SLMs with strong performance, conventional approaches either pre-train the models from scratch, which incurs substantial computational costs, or compress/prune existing large language models (LLMs), which results in performance drops and falls short in comparison to pre-training.<n>We found 1) layer-wise adaptive pruning (Adapt-Pruner) is extremely effective in LLMs and yields significant improvements over existing pruning techniques, 2) adaptive pruning equipped with further training leads to models comparable to those pre-training from scratch
arXiv Detail & Related papers (2025-02-05T18:57:40Z) - Progressive Binarization with Semi-Structured Pruning for LLMs [36.32239429974179]
We propose Progressive Binarization with Semi-Structured Pruning (PBS$2$P), a novel post-training compression framework.<n>We show that PBS$2$P consistently outperforms state-of-the-art binary post-training quantization methods in both perplexity and downstream accuracy.
arXiv Detail & Related papers (2025-02-03T13:30:29Z) - Pruning Foundation Models for High Accuracy without Retraining [48.256389781305415]
It is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations.
Post-training pruning methods are proposed to prune LLMs in one-shot without retraining.
Our experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines.
arXiv Detail & Related papers (2024-10-21T01:23:34Z) - Pruning Large Language Models with Semi-Structural Adaptive Sparse Training [17.381160429641316]
Adaptive Sparse Trainer (AST) is a novel and efficient retraining framework tailored for semi-structured sparse models.<n>AST reduces the perplexity and zero-shot accuracy gap between dense and 2:4 semi-structured sparse models to 0.6 and 1.16%, respectively.
arXiv Detail & Related papers (2024-07-30T06:33:44Z) - Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.<n>We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks.<n>Experiments conducted on LLaMA, LLaMA-2, LLaMA-3, Vicuna, and Mistral models demonstrate the promising performance of our method in efficiency and effectiveness.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models [53.638791265113625]
Sparsity-Preserved efficient fine-tuning method for large language models.
Code will be made available at https://github.com/Lucky-Lance/SPP.
arXiv Detail & Related papers (2024-05-25T04:55:27Z) - Fast and Effective Weight Update for Pruned Large Language Models [0.0]
Pruning large language models (LLMs) is a challenging task due to their enormous size.
Recent approaches have either ignored fine-tuning entirely, or attempted layer-wise weight updates.
We propose a fast and effective weight update algorithm for pruned layers based on the Alternating Direction Method of Multipliers.
arXiv Detail & Related papers (2024-01-01T23:10:23Z) - Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models [30.246821533532017]
Large Language Models (LLMs) with billions of parameters are prime targets for network pruning, removing some model weights without hurting performance.
We present a novel sparsity-centric pruning method for pretrained LLMs, termed Gradient-based Language Model Pruner (GBLM-Pruner)
arXiv Detail & Related papers (2023-11-08T18:59:54Z) - A Simple and Effective Pruning Approach for Large Language Models [58.716255689941896]
Large Languages Models (LLMs) are natural candidates for network pruning methods.
Existing methods, however, require either retraining, or solving a weight reconstruction problem reliant on second-order information.
We introduce a novel, straightforward yet effective pruning method, termed Wanda (Pruning by Weights and activations), designed to induce sparsity in pretrained LLMs.
arXiv Detail & Related papers (2023-06-20T17:18:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.