UniPruning: Unifying Local Metric and Global Feedback for Scalable Sparse LLMs
- URL: http://arxiv.org/abs/2510.03291v1
- Date: Mon, 29 Sep 2025 13:38:28 GMT
- Title: UniPruning: Unifying Local Metric and Global Feedback for Scalable Sparse LLMs
- Authors: Yizhuo Ding, Wanying Qu, Jiawei Geng, Wenqi Shao, Yanwei Fu,
- Abstract summary: Large Language Models (LLMs) achieve strong performance across diverse tasks but face prohibitive computational and memory costs.<n>We present UniPruning, a unified post-training pruning framework that combines the speed of local saliency metrics with the stability of global coordination.<n>UniPruning consistently delivers competitive or superior perplexity and zero-shot accuracy.
- Score: 46.12497343562301
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) achieve strong performance across diverse tasks but face prohibitive computational and memory costs. Pruning offers a promising path by inducing sparsity while preserving architectural flexibility. However, existing methods struggle to balance efficiency and robustness: local metric approaches prune layer by layer but often collapse under high sparsity, whereas global feedback methods enforce consistency at the cost of expensive weight updates or restrictive semi-structured formats. We present UniPruning, a unified post-training pruning framework that combines the speed of local saliency metrics with the stability of global coordination, enabled by a mirror descent based optimization, all without updating model weights. UniPruning leverages fast layer-wise scoring and a lightweight global controller to allocate a single sparsity budget, supporting both unstructured and semi-structured N :M pruning within one framework. After a brief calibration, it can generate pruning masks for arbitrary sparsity levels in one shot, and adapts seamlessly to hardware-aware constraints. Extensive experiments on multiple pretrained LLM families and standard benchmarks show that UniPruning consistently delivers competitive or superior perplexity and zero-shot accuracy. Ablation studies further highlight the importance of mirror descent and local saliency anchoring. Overall, UniPruning provides an efficient, principled, and scalable solution for sparsifying large-scale LLMs. Our code is available at: https://github.com/RainbowQTT/UniPruning.
Related papers
- StructPrune: Structured Global Pruning asymptotics with $\mathcal{O}(\sqrt{N})$ GPU Memory [11.996799691784693]
Pruning is critical for scaling large language models (LLMs)<n>Global pruning achieves strong performance but requires $mathcalO(N)$ memory.<n>Local pruning reduces GPU memory usage to that of a single layer by pruning layers independently.
arXiv Detail & Related papers (2025-09-25T19:16:50Z) - MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on Large Language Models [53.36415620647177]
Semi-structured sparsity offers a promising solution by strategically retaining $N$ elements out of every $M$ weights.<n>Existing (N:M)-compatible approaches typically fall into two categories: rule-based layerwise greedy search, which suffers from considerable errors, and gradient-driven learning, which incurs prohibitive training costs.<n>We propose a novel linear-space probabilistic framework named MaskPro, which aims to learn a prior categorical distribution for every $M$ consecutive weights and subsequently leverages this distribution to generate the (N:M)-sparsity throughout an $N$-way sampling
arXiv Detail & Related papers (2025-06-15T15:02:59Z) - Stochastic Layer-wise Learning: Scalable and Efficient Alternative to Backpropagation [1.0285749562751982]
Backpropagation underpins modern deep learning, yet its reliance on global synchronization limits scalability and incurs high memory costs.<n>In contrast, fully local learning rules are more efficient but often struggle to maintain the cross-layer coordination needed for coherent global learning.<n>We introduce Layer-wise Learning (SLL), a layer-wise training algorithm that decomposes the global objective into coordinated layer-local updates.
arXiv Detail & Related papers (2025-05-08T12:32:29Z) - Týr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLMs via Global Sparsity Distribution Optimization [15.027017826182659]
T'yr-the-Pruner is an efficient end-to-end search-based global structural pruning framework.<n>We introduce an effective local pruning and an expectation error accumulation approach to improve supernet construction.<n>Results show that T'yr-the-Pruner achieves state-of-the-art structural pruning, retaining 97% of the dense model's performance while removing a challenging 50% of Llama-3.1-70B's parameters.
arXiv Detail & Related papers (2025-03-12T11:52:49Z) - Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.<n>We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks.<n>Experiments conducted on LLaMA, LLaMA-2, LLaMA-3, Vicuna, and Mistral models demonstrate the promising performance of our method in efficiency and effectiveness.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - Stragglers-Aware Low-Latency Synchronous Federated Learning via Layer-Wise Model Updates [71.81037644563217]
Synchronous federated learning (FL) is a popular paradigm for collaborative edge learning.
As some of the devices may have limited computational resources and varying availability, FL latency is highly sensitive to stragglers.
We propose straggler-aware layer-wise federated learning (SALF) that leverages the optimization procedure of NNs via backpropagation to update the global model in a layer-wise fashion.
arXiv Detail & Related papers (2024-03-27T09:14:36Z) - SparseLLM: Towards Global Pruning for Pre-trained Language Models [12.057369029549534]
We propose SparseLLM, a novel framework that redefines the global pruning process into manageable, coordinated subproblems.
SparseLLM's approach conceptualizes LLMs as a chain of modular functions and leverages auxiliary variables for problem decomposition.
It demonstrates significant performance improvements, particularly in high-sparsity regimes.
arXiv Detail & Related papers (2024-02-28T00:09:07Z) - Fluctuation-based Adaptive Structured Pruning for Large Language Models [44.217363567065]
FLAP (FLuctuation-based Adaptive Structured Pruning) is a retraining-free structured pruning framework for Large Language Models.
It is hardware-friendly by effectively reducing storage and enhancing inference speed.
arXiv Detail & Related papers (2023-12-19T09:23:48Z) - Dynamic Regularized Sharpness Aware Minimization in Federated Learning: Approaching Global Consistency and Smooth Landscape [59.841889495864386]
In federated learning (FL), a cluster of local clients are chaired under the coordination of a global server.
Clients are prone to overfit into their own optima, which extremely deviates from the global objective.
ttfamily FedSMOO adopts a dynamic regularizer to guarantee the local optima towards the global objective.
Our theoretical analysis indicates that ttfamily FedSMOO achieves fast $mathcalO (1/T)$ convergence rate with low bound generalization.
arXiv Detail & Related papers (2023-05-19T10:47:44Z) - Layer-adaptive sparsity for the Magnitude-based Pruning [88.37510230946478]
We propose a novel importance score for global pruning, coined layer-adaptive magnitude-based pruning (LAMP) score.
LAMP consistently outperforms popular existing schemes for layerwise sparsity selection.
arXiv Detail & Related papers (2020-10-15T09:14:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.