Related papers: UniPruning: Unifying Local Metric and Global Feedback for Scalable Sparse LLMs

UniPruning: Unifying Local Metric and Global Feedback for Scalable Sparse LLMs

URL: http://arxiv.org/abs/2510.03291v1
Date: Mon, 29 Sep 2025 13:38:28 GMT
Title: UniPruning: Unifying Local Metric and Global Feedback for Scalable Sparse LLMs
Authors: Yizhuo Ding, Wanying Qu, Jiawei Geng, Wenqi Shao, Yanwei Fu,
Abstract summary: Large Language Models (LLMs) achieve strong performance across diverse tasks but face prohibitive computational and memory costs.<n>We present UniPruning, a unified post-training pruning framework that combines the speed of local saliency metrics with the stability of global coordination.<n>UniPruning consistently delivers competitive or superior perplexity and zero-shot accuracy.
Score: 46.12497343562301
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) achieve strong performance across diverse tasks but face prohibitive computational and memory costs. Pruning offers a promising path by inducing sparsity while preserving architectural flexibility. However, existing methods struggle to balance efficiency and robustness: local metric approaches prune layer by layer but often collapse under high sparsity, whereas global feedback methods enforce consistency at the cost of expensive weight updates or restrictive semi-structured formats. We present UniPruning, a unified post-training pruning framework that combines the speed of local saliency metrics with the stability of global coordination, enabled by a mirror descent based optimization, all without updating model weights. UniPruning leverages fast layer-wise scoring and a lightweight global controller to allocate a single sparsity budget, supporting both unstructured and semi-structured N :M pruning within one framework. After a brief calibration, it can generate pruning masks for arbitrary sparsity levels in one shot, and adapts seamlessly to hardware-aware constraints. Extensive experiments on multiple pretrained LLM families and standard benchmarks show that UniPruning consistently delivers competitive or superior perplexity and zero-shot accuracy. Ablation studies further highlight the importance of mirror descent and local saliency anchoring. Overall, UniPruning provides an efficient, principled, and scalable solution for sparsifying large-scale LLMs. Our code is available at: https://github.com/RainbowQTT/UniPruning.

Related papers

StructPrune: Structured Global Pruning asymptotics with $\mathcal{O}(\sqrt{N})$ GPU Memory [11.996799691784693]
Pruning is critical for scaling large language models (LLMs)<n>Global pruning achieves strong performance but requires $mathcalO(N)$ memory.<n>Local pruning reduces GPU memory usage to that of a single layer by pruning layers independently.
arXiv Detail & Related papers (2025-09-25T19:16:50Z)
MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on Large Language Models [53.36415620647177]
Semi-structured sparsity offers a promising solution by strategically retaining $N$ elements out of every $M$ weights.<n>Existing (N:M)-compatible approaches typically fall into two categories: rule-based layerwise greedy search, which suffers from considerable errors, and gradient-driven learning, which incurs prohibitive training costs.<n>We propose a novel linear-space probabilistic framework named MaskPro, which aims to learn a prior categorical distribution for every $M$ consecutive weights and subsequently leverages this distribution to generate the (N:M)-sparsity throughout an $N$-way sampling
arXiv Detail & Related papers (2025-06-15T15:02:59Z)
Stochastic Layer-wise Learning: Scalable and Efficient Alternative to Backpropagation [1.0285749562751982]
Backpropagation underpins modern deep learning, yet its reliance on global synchronization limits scalability and incurs high memory costs.<n>In contrast, fully local learning rules are more efficient but often struggle to maintain the cross-layer coordination needed for coherent global learning.<n>We introduce Layer-wise Learning (SLL), a layer-wise training algorithm that decomposes the global objective into coordinated layer-local updates.
arXiv Detail & Related papers (2025-05-08T12:32:29Z)
Týr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLMs via Global Sparsity Distribution Optimization [15.027017826182659]
T'yr-the-Pruner is an efficient end-to-end search-based global structural pruning framework.<n>We introduce an effective local pruning and an expectation error accumulation approach to improve supernet construction.<n>Results show that T'yr-the-Pruner achieves state-of-the-art structural pruning, retaining 97% of the dense model's performance while removing a challenging 50% of Llama-3.1-70B's parameters.
arXiv Detail & Related papers (2025-03-12T11:52:49Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.<n>We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks.<n>Experiments conducted on LLaMA, LLaMA-2, LLaMA-3, Vicuna, and Mistral models demonstrate the promising performance of our method in efficiency and effectiveness.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
Stragglers-Aware Low-Latency Synchronous Federated Learning via Layer-Wise Model Updates [71.81037644563217]
Synchronous federated learning (FL) is a popular paradigm for collaborative edge learning. As some of the devices may have limited computational resources and varying availability, FL latency is highly sensitive to stragglers. We propose straggler-aware layer-wise federated learning (SALF) that leverages the optimization procedure of NNs via backpropagation to update the global model in a layer-wise fashion.
arXiv Detail & Related papers (2024-03-27T09:14:36Z)
SparseLLM: Towards Global Pruning for Pre-trained Language Models [12.057369029549534]
We propose SparseLLM, a novel framework that redefines the global pruning process into manageable, coordinated subproblems. SparseLLM's approach conceptualizes LLMs as a chain of modular functions and leverages auxiliary variables for problem decomposition. It demonstrates significant performance improvements, particularly in high-sparsity regimes.
arXiv Detail & Related papers (2024-02-28T00:09:07Z)
Fluctuation-based Adaptive Structured Pruning for Large Language Models [44.217363567065]
FLAP (FLuctuation-based Adaptive Structured Pruning) is a retraining-free structured pruning framework for Large Language Models. It is hardware-friendly by effectively reducing storage and enhancing inference speed.
arXiv Detail & Related papers (2023-12-19T09:23:48Z)
Dynamic Regularized Sharpness Aware Minimization in Federated Learning: Approaching Global Consistency and Smooth Landscape [59.841889495864386]
In federated learning (FL), a cluster of local clients are chaired under the coordination of a global server. Clients are prone to overfit into their own optima, which extremely deviates from the global objective. ttfamily FedSMOO adopts a dynamic regularizer to guarantee the local optima towards the global objective. Our theoretical analysis indicates that ttfamily FedSMOO achieves fast $mathcalO (1/T)$ convergence rate with low bound generalization.
arXiv Detail & Related papers (2023-05-19T10:47:44Z)
Layer-adaptive sparsity for the Magnitude-based Pruning [88.37510230946478]
We propose a novel importance score for global pruning, coined layer-adaptive magnitude-based pruning (LAMP) score. LAMP consistently outperforms popular existing schemes for layerwise sparsity selection.
arXiv Detail & Related papers (2020-10-15T09:14:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.