POP: Online Structural Pruning Enables Efficient Inference of Large Foundation Models
- URL: http://arxiv.org/abs/2602.06822v1
- Date: Fri, 06 Feb 2026 16:07:42 GMT
- Title: POP: Online Structural Pruning Enables Efficient Inference of Large Foundation Models
- Authors: Yi Chen, Wonjin Shin, Shuhong Liu, Tho Mai, Jeongmo Lee, Chuanbo Hua, Kun Wang, Jun Liu, Joo-Young Kim,
- Abstract summary: POP (Partition-guided Online Pruning) is an efficient online structural pruning framework with minimal computational overhead.<n> POP is a lightweight, plug-and-play method that requires no preprocessing, including offline calibration, retraining, or learning predictors.
- Score: 12.10403234534641
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large foundation models (LFMs) achieve strong performance through scaling, yet current structural pruning methods derive fixed pruning decisions during inference, overlooking sparsity patterns that emerge in the autoregressive token generation. In this paper, we propose POP (Partition-guided Online Pruning), an efficient online structural pruning framework that enables context-conditioned dynamic pruning with minimal computational overhead. POP partitions model channels into retained, candidate, and pruned regions, where prefilling defines a coarse pruning partition, and the decoding stage generates a fine-grained mask within the candidate region, avoiding full-channel re-evaluation. The coarse pruning partition preserves consistently important weights, while the fine-grained masking provides context-conditioned variation during decoding. Moreover, POP is a lightweight, plug-and-play method that requires no preprocessing, including offline calibration, retraining, or learning predictors. Extensive evaluations across diverse LFMs, including large language models (LLMs), mixture-of-experts models (MoEs), and vision-language models (VLMs), demonstrate that POP consistently delivers higher accuracy than existing pruning approaches while incurring smaller computational overhead and minimizing inference latency.
Related papers
- StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models [98.72926158261937]
We propose a training-free token pruning framework for Visual AutoRegressive models.<n>We employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information.<n>To maintain valid next-scale prediction under sparse tokens, we introduce a nearest neighbor feature propagation strategy.
arXiv Detail & Related papers (2026-03-02T11:35:05Z) - POP: Prefill-Only Pruning for Efficient Large Model Inference [5.743318651374061]
Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities.<n>Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation.<n>We argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages.
arXiv Detail & Related papers (2026-02-03T09:22:26Z) - Which Layer Causes Distribution Deviation? Entropy-Guided Adaptive Pruning for Diffusion and Flow Models [77.55829017952728]
EntPruner is an entropy-guided automatic progressive pruning framework for diffusion and flow models.<n>Experiments on DiT and SiT models demonstrate the effectiveness of EntPruner, achieving up to 2.22$times$ inference speedup.
arXiv Detail & Related papers (2025-11-26T07:20:48Z) - Towards Efficient General Feature Prediction in Masked Skeleton Modeling [59.46799426434277]
We propose a novel General Feature Prediction framework (GFP) for efficient mask skeleton modeling.<n>Our key innovation is replacing conventional low-level reconstruction with high-level feature prediction that spans from local motion patterns to global semantic representations.
arXiv Detail & Related papers (2025-09-03T18:05:02Z) - FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression [15.784158079414235]
FLAT-LLM is a training-free structural compression method based on fine-grained low-rank transformations in the activation space.<n>It achieves efficient and effective weight compression without recovery fine-tuning, which could complete the calibration within a few minutes.
arXiv Detail & Related papers (2025-05-29T19:42:35Z) - Model Hemorrhage and the Robustness Limits of Large Language Models [119.46442117681147]
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment.<n>We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes.
arXiv Detail & Related papers (2025-03-31T10:16:03Z) - Sample-aware Adaptive Structured Pruning for Large Language Models [14.605017410864583]
This study introduces AdaPruner, a sample-aware adaptive structured pruning framework for large language models (LLMs)<n>Specifically, AdaPruner effectively removes redundant parameters from LLMs by constructing a structured pruning solution space.<n>At a 20% pruning ratio, the model pruned with AdaPruner maintains 97% of the performance of the unpruned model.
arXiv Detail & Related papers (2025-03-08T12:00:21Z) - PIP: Perturbation-based Iterative Pruning for Large Language Models [15.00536465178398]
We propose PIP (Perturbation-based Iterative Pruning), a novel double-view structured pruning method to optimize Large Language Models.<n>With the calculation of gradient differences, PIP iteratively prunes those that struggle to distinguish between these two views.<n>Our experiments show that PIP reduces the parameter count by approximately 20% while retaining over 85% of the original model's accuracy.
arXiv Detail & Related papers (2025-01-25T17:10:50Z) - Instruction-Following Pruning for Large Language Models [58.329978053711024]
We move beyond the traditional static pruning approach of determining a fixed pruning mask for a model.<n>In our method, the pruning mask is input-dependent and adapts dynamically based on the information described in a user instruction.<n>Our approach, termed "instruction-following pruning", introduces a sparse mask predictor that takes the user instruction as input and dynamically selects the most relevant model parameters for the given task.
arXiv Detail & Related papers (2025-01-03T20:19:14Z) - QPruner: Probabilistic Decision Quantization for Structured Pruning in Large Language Models [3.093903491123962]
Large language models (LLMs) have significantly advanced various natural language processing (NLP) tasks.<n> structured pruning is an effective approach to reducing model size, but it often results in significant accuracy degradation.<n>We introduce quantization into the structured pruning framework to reduce memory consumption during both fine-tuning and inference.<n>We propose QPruner, a novel framework that employs structured pruning to reduce model size, followed by a layer-wise mixed-precision quantization scheme.
arXiv Detail & Related papers (2024-12-16T10:14:01Z) - DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization [59.96455188197593]
Large language models (LLMs) deliver impressive results but face challenges from increasing model sizes and computational costs.<n>We propose DRPruning, a method that dynamically adjusts the data distribution during training to restore balanced performance across heterogeneous and multi-tasking data.<n> Experiments in monolingual and multilingual settings show that DRPruning surpasses similarly sized models in both pruning and continued pretraining over perplexity, downstream tasks, and instruction tuning.
arXiv Detail & Related papers (2024-11-21T12:02:39Z) - Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.<n>We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks.<n>Experiments conducted on LLaMA, LLaMA-2, LLaMA-3, Vicuna, and Mistral models demonstrate the promising performance of our method in efficiency and effectiveness.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - MLPruning: A Multilevel Structured Pruning Framework for
Transformer-based Models [78.45898846056303]
Pruning is an effective method to reduce the memory footprint and computational cost associated with large natural language processing models.
We develop a novel MultiLevel structured Pruning framework, which uses three different levels of structured pruning: head pruning, row pruning, and block-wise sparse pruning.
arXiv Detail & Related papers (2021-05-30T22:00:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.