Related papers: $D^2Prune$: Sparsifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness

$D^2Prune$: Sparsifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness

URL: http://arxiv.org/abs/2601.09176v1
Date: Wed, 14 Jan 2026 05:17:35 GMT
Title: $D^2Prune$: Sparsifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness
Authors: Lang Xiong, Ning Liu, Ao Ren, Yuheng Bai, Haining Fang, BinYan Zhang, Zhe Jiang, Yujuan Tan, Duo Liu,
Abstract summary: Large language models (LLMs) face significant deployment challenges due to their massive computational demands.<n>This paper proposes a novel pruning method, $D2Prune$, to address these limitations.<n>$D2Prune$ consistently outperforms SOTA methods across various LLMs.
Score: 13.59262810896553
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) face significant deployment challenges due to their massive computational demands. % While pruning offers a promising compression solution, existing methods suffer from two critical limitations: (1) They neglect activation distribution shifts between calibration data and test data, resulting in inaccurate error estimations; (2) They overlook the long-tail distribution characteristics of activations in the attention module. To address these limitations, this paper proposes a novel pruning method, $D^2Prune$. First, we propose a dual Taylor expansion-based method that jointly models weight and activation perturbations for precise error estimation, leading to precise pruning mask selection and weight updating and facilitating error minimization during pruning. % Second, we propose an attention-aware dynamic update strategy that preserves the long-tail attention pattern by jointly minimizing the KL divergence of attention distributions and the reconstruction error. Extensive experiments show that $D^2Prune$ consistently outperforms SOTA methods across various LLMs (e.g., OPT-125M, LLaMA2/3, and Qwen3). Moreover, the dynamic attention update mechanism also generalizes well to ViT-based vision models like DeiT, achieving superior accuracy on ImageNet-1K.

Related papers

Mixture of Distributions Matters: Dynamic Sparse Attention for Efficient Video Diffusion Transformers [13.366686736005699]
We present MOD-DiT, a sampling-free dynamic attention framework.<n>It accurately models evolving attention patterns through a two-stage process.<n>It overcomes the computational limitations of traditional sparse attention approaches.
arXiv Detail & Related papers (2026-01-14T16:25:39Z)
Which Layer Causes Distribution Deviation? Entropy-Guided Adaptive Pruning for Diffusion and Flow Models [77.55829017952728]
EntPruner is an entropy-guided automatic progressive pruning framework for diffusion and flow models.<n>Experiments on DiT and SiT models demonstrate the effectiveness of EntPruner, achieving up to 2.22$times$ inference speedup.
arXiv Detail & Related papers (2025-11-26T07:20:48Z)
D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation [26.820694706602236]
Detector-to-Differentiable (D2D) is a novel framework that transforms non-differentiable detection models into differentiable critics.<n>Our experiments on SDXL-Turbo, SD-Turbo, and Pixart-DMD demonstrate consistent and substantial improvements in object counting accuracy.
arXiv Detail & Related papers (2025-10-22T06:27:05Z)
Accelerated Aggregated D-Optimal Designs for Estimating Main Effects in Black-Box Models [3.093890460224435]
We propose A2D2E, an $textbfE$stimator based on $textbfA$ccelerated $textbfA$ggregated $textbfD$esigns.<n>We establish theoretical guarantees, including convergence and variance reduction, and validate A2D2E through extensive simulations.
arXiv Detail & Related papers (2025-10-09T17:07:36Z)
Dual-Stage Reweighted MoE for Long-Tailed Egocentric Mistake Detection [85.0189917888094]
We propose a Dual-Stage Reweighted Mixture-of-Experts (DR-MoE) framework to handle the challenges posed by subtle and infrequent mistakes.<n>The proposed method achieves strong performance, particularly in identifying rare and ambiguous mistake instances.
arXiv Detail & Related papers (2025-09-16T12:00:42Z)
Attention, Please! Revisiting Attentive Probing Through the Lens of Efficiency [20.320991233039965]
As fine-tuning becomes impractical at scale, probing is emerging as the preferred evaluation protocol.<n>This motivates the need for attentive probing, an alternative that uses attention to selectively aggregate patch-level features.<n>We present the first comprehensive study of existing methods, analyzing their design choices and benchmarking their performance.
arXiv Detail & Related papers (2025-06-11T21:10:26Z)
SPAT: Sensitivity-based Multihead-attention Pruning on Time Series Forecasting Models [8.817690876855728]
We propose a structured pruning method, SPAT ($textbfS$ensitivity $textbfP$runer for $textbfAt$tention), which selectively removes attention mechanisms and yields highly effective models.<n>Experiments on datasets demonstrate that SPAT-pruned models achieve reductions of 2.842% in MSE, 1.996% in MAE, and 35.274% in FLOPs.
arXiv Detail & Related papers (2025-05-13T17:39:31Z)
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z)
Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors. We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor. Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z)
Model Inversion Attacks Through Target-Specific Conditional Diffusion Models [54.69008212790426]
Model inversion attacks (MIAs) aim to reconstruct private images from a target classifier's training set, thereby raising privacy concerns in AI applications. Previous GAN-based MIAs tend to suffer from inferior generative fidelity due to GAN's inherent flaws and biased optimization within latent space. We propose Diffusion-based Model Inversion (Diff-MI) attacks to alleviate these issues.
arXiv Detail & Related papers (2024-07-16T06:38:49Z)
Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost [53.746169882193456]
Recent works have proposed various sparse attention modules to overcome the quadratic cost of self-attention. We propose a model that resolves both problems by endowing each attention head with a mixed-membership Block Model. Our model outperforms previous efficient variants as well as the original Transformer with full attention.
arXiv Detail & Related papers (2022-10-27T15:30:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.