$D^2Prune$: Sparsifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness
- URL: http://arxiv.org/abs/2601.09176v1
- Date: Wed, 14 Jan 2026 05:17:35 GMT
- Title: $D^2Prune$: Sparsifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness
- Authors: Lang Xiong, Ning Liu, Ao Ren, Yuheng Bai, Haining Fang, BinYan Zhang, Zhe Jiang, Yujuan Tan, Duo Liu,
- Abstract summary: Large language models (LLMs) face significant deployment challenges due to their massive computational demands.<n>This paper proposes a novel pruning method, $D2Prune$, to address these limitations.<n>$D2Prune$ consistently outperforms SOTA methods across various LLMs.
- Score: 13.59262810896553
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) face significant deployment challenges due to their massive computational demands. % While pruning offers a promising compression solution, existing methods suffer from two critical limitations: (1) They neglect activation distribution shifts between calibration data and test data, resulting in inaccurate error estimations; (2) They overlook the long-tail distribution characteristics of activations in the attention module. To address these limitations, this paper proposes a novel pruning method, $D^2Prune$. First, we propose a dual Taylor expansion-based method that jointly models weight and activation perturbations for precise error estimation, leading to precise pruning mask selection and weight updating and facilitating error minimization during pruning. % Second, we propose an attention-aware dynamic update strategy that preserves the long-tail attention pattern by jointly minimizing the KL divergence of attention distributions and the reconstruction error. Extensive experiments show that $D^2Prune$ consistently outperforms SOTA methods across various LLMs (e.g., OPT-125M, LLaMA2/3, and Qwen3). Moreover, the dynamic attention update mechanism also generalizes well to ViT-based vision models like DeiT, achieving superior accuracy on ImageNet-1K.
Related papers
- Mixture of Distributions Matters: Dynamic Sparse Attention for Efficient Video Diffusion Transformers [13.366686736005699]
We present MOD-DiT, a sampling-free dynamic attention framework.<n>It accurately models evolving attention patterns through a two-stage process.<n>It overcomes the computational limitations of traditional sparse attention approaches.
arXiv Detail & Related papers (2026-01-14T16:25:39Z) - Which Layer Causes Distribution Deviation? Entropy-Guided Adaptive Pruning for Diffusion and Flow Models [77.55829017952728]
EntPruner is an entropy-guided automatic progressive pruning framework for diffusion and flow models.<n>Experiments on DiT and SiT models demonstrate the effectiveness of EntPruner, achieving up to 2.22$times$ inference speedup.
arXiv Detail & Related papers (2025-11-26T07:20:48Z) - D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation [26.820694706602236]
Detector-to-Differentiable (D2D) is a novel framework that transforms non-differentiable detection models into differentiable critics.<n>Our experiments on SDXL-Turbo, SD-Turbo, and Pixart-DMD demonstrate consistent and substantial improvements in object counting accuracy.
arXiv Detail & Related papers (2025-10-22T06:27:05Z) - Accelerated Aggregated D-Optimal Designs for Estimating Main Effects in Black-Box Models [3.093890460224435]
We propose A2D2E, an $textbfE$stimator based on $textbfA$ccelerated $textbfA$ggregated $textbfD$esigns.<n>We establish theoretical guarantees, including convergence and variance reduction, and validate A2D2E through extensive simulations.
arXiv Detail & Related papers (2025-10-09T17:07:36Z) - Dual-Stage Reweighted MoE for Long-Tailed Egocentric Mistake Detection [85.0189917888094]
We propose a Dual-Stage Reweighted Mixture-of-Experts (DR-MoE) framework to handle the challenges posed by subtle and infrequent mistakes.<n>The proposed method achieves strong performance, particularly in identifying rare and ambiguous mistake instances.
arXiv Detail & Related papers (2025-09-16T12:00:42Z) - Attention, Please! Revisiting Attentive Probing Through the Lens of Efficiency [20.320991233039965]
As fine-tuning becomes impractical at scale, probing is emerging as the preferred evaluation protocol.<n>This motivates the need for attentive probing, an alternative that uses attention to selectively aggregate patch-level features.<n>We present the first comprehensive study of existing methods, analyzing their design choices and benchmarking their performance.
arXiv Detail & Related papers (2025-06-11T21:10:26Z) - SPAT: Sensitivity-based Multihead-attention Pruning on Time Series Forecasting Models [8.817690876855728]
We propose a structured pruning method, SPAT ($textbfS$ensitivity $textbfP$runer for $textbfAt$tention), which selectively removes attention mechanisms and yields highly effective models.<n>Experiments on datasets demonstrate that SPAT-pruned models achieve reductions of 2.842% in MSE, 1.996% in MAE, and 35.274% in FLOPs.
arXiv Detail & Related papers (2025-05-13T17:39:31Z) - Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z) - Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors.
We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor.
Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z) - Model Inversion Attacks Through Target-Specific Conditional Diffusion Models [54.69008212790426]
Model inversion attacks (MIAs) aim to reconstruct private images from a target classifier's training set, thereby raising privacy concerns in AI applications.
Previous GAN-based MIAs tend to suffer from inferior generative fidelity due to GAN's inherent flaws and biased optimization within latent space.
We propose Diffusion-based Model Inversion (Diff-MI) attacks to alleviate these issues.
arXiv Detail & Related papers (2024-07-16T06:38:49Z) - Transformers meet Stochastic Block Models: Attention with Data-Adaptive
Sparsity and Cost [53.746169882193456]
Recent works have proposed various sparse attention modules to overcome the quadratic cost of self-attention.
We propose a model that resolves both problems by endowing each attention head with a mixed-membership Block Model.
Our model outperforms previous efficient variants as well as the original Transformer with full attention.
arXiv Detail & Related papers (2022-10-27T15:30:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.