WiSparse: Boosting LLM Inference Efficiency with Weight-Aware Mixed Activation Sparsity
- URL: http://arxiv.org/abs/2602.14452v1
- Date: Mon, 16 Feb 2026 04:18:36 GMT
- Title: WiSparse: Boosting LLM Inference Efficiency with Weight-Aware Mixed Activation Sparsity
- Authors: Lei Chen, Yuan Meng, Xiaoyu Zhan, Zhi Wang, Wenwu Zhu,
- Abstract summary: Training-free activation sparsity is a promising approach for efficient Large Language Models inference.<n>We propose Weight-aware Mixed-Granularity Training-free Activation Sparsity (WiSparse)<n>We show that WiSparse preserves 97% of Llama3.1's dense performance, surpassing the strongest baseline by 2.23 percentage points.
- Score: 32.61817486761883
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) offer strong capabilities but incur high inference costs due to dense computation and memory access. Training-free activation sparsity is a promising approach for efficient LLM inference, yet existing methods often rely solely on activation information and uniform sparsity ratios. This overlooks the critical interplay with weights and inter-block sensitivity variation, leading to suboptimal performance. We identify two key phenomena in modern LLMs: 1) less significant activations may align with highly important weights, and 2) sparsity sensitivity varies non-monotonically across model blocks. We propose Weight-aware Mixed-Granularity Training-free Activation Sparsity (WiSparse), which leverages both activation and weight information for adaptive sparsity allocation. Specifically, we introduce a weight-aware mechanism integrating activation magnitudes with precomputed weight norms to accurately identify salient channels. This is combined with a mixed-granularity allocation scheme: a global budget is distributed across blocks via evolutionary search to protect sensitive regions, then refined within blocks to minimize reconstruction error. We improve sparse kernels and demonstrate effectiveness on three representative models. Notably, at 50% sparsity, WiSparse preserves 97% of Llama3.1's dense performance, surpassing the strongest baseline by 2.23 percentage points while achieving a 21.4% acceleration in end-to-end inference speed. Our research advances the limits of training-free approaches for efficient LLM inference, pushing the boundaries of achievable speedup without training.
Related papers
- Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models [97.55009021098554]
This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training.<n>We introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy-efficiency frontier of state-of-the-art SLMs.
arXiv Detail & Related papers (2025-11-24T08:46:36Z) - Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle [65.14124923451077]
Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM)<n>However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing and Rollout Silencing.<n>We propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition.
arXiv Detail & Related papers (2025-08-07T17:53:47Z) - WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference [44.538579135121466]
WINA (Weight Informed Neuron Activation) is a novel, simple, and training-free sparse activation framework.<n>We show that WINA obtains optimal approximation error bounds with theoretical guarantees tighter than existing techniques.<n>It also outperforms state-of-the-art methods (e.g., TEAL) by up to $2.94%$ in average performance at the same sparsity levels.
arXiv Detail & Related papers (2025-05-26T02:37:32Z) - Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity [4.24164487223914]
We introduce Polar Sparsity, highlighting a key shift in sparsity importance from dense to Attention layers as we scale batch size and sequence length.<n>We develop hardware-efficient, sparsity-aware kernels for selective computation and Attention, delivering up to (2.2times) end-to-end speed for models like OPT, LLaMA-2 & 3, across various batch sizes and sequence lengths without compromising accuracy.
arXiv Detail & Related papers (2025-05-20T20:15:42Z) - R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs.<n> Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z) - Activation Sparsity Opportunities for Compressing General Large Language Models [4.5624217435826]
This work systematically investigates the tradeoff between enforcing activation sparsity and perplexity (accuracy) on state-of-the-art AI models.<n>Our empirical analysis demonstrates that we can obtain around 50% of main memory and computing reductions for critical FFN components with negligible accuracy degradation.
arXiv Detail & Related papers (2024-12-13T02:26:54Z) - Sparsing Law: Towards Large Language Models with Greater Activation Sparsity [64.15238674475619]
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated.<n>We propose PPL-$p%$ sparsity, a precise and performance-aware activation sparsity metric.<n>We show that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity.
arXiv Detail & Related papers (2024-11-04T17:59:04Z) - Q-Sparse: All Large Language Models can be Fully Sparsely-Activated [93.45300714803429]
We introduce Q-Sparse, a simple yet effective approach to training sparsely-activated large language models (LLMs)
Q-Sparse enables full sparsity of activations in LLMs which can bring significant efficiency gains in inference.
We also introduce Block Q-Sparse for batch training and inference.
arXiv Detail & Related papers (2024-07-15T17:59:29Z) - One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models [42.95555008229016]
We propose a method based on Hessian sensitivity-aware mixed sparsity pruning to prune LLMs to at least 50% sparsity without the need of any retraining.
The advantages of the proposed method exhibit even more when the sparsity is extremely high.
arXiv Detail & Related papers (2023-10-14T05:43:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.