Universal Properties of Activation Sparsity in Modern Large Language Models
- URL: http://arxiv.org/abs/2509.00454v1
- Date: Sat, 30 Aug 2025 10:47:21 GMT
- Title: Universal Properties of Activation Sparsity in Modern Large Language Models
- Authors: Filip Szatkowski, Patryk Będkowski, Alessio Devoto, Jan Dubiński, Pasquale Minervini, Mikołaj Piórczyński, Simone Scardapane, Bartosz Wójcik,
- Abstract summary: We present a framework to assess sparsity robustness and present a systematic study of the phenomenon in the FFN layers of modern LLMs.<n>Our findings reveal universal patterns of activation sparsity in LLMs, provide insights into this phenomenon, and offer practical guidelines for exploiting it in model design and acceleration.
- Score: 20.84931970096774
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Input-dependent activation sparsity is a notable property of deep learning models, which has been extensively studied in networks with ReLU activations and is associated with efficiency, robustness, and interpretability. However, the approaches developed for ReLU-based models depend on exact zero activations and do not transfer directly to modern large language models~(LLMs), which have abandoned ReLU in favor of other activation functions. As a result, current work on activation sparsity in LLMs is fragmented, model-specific, and lacks consensus on which components to target. We propose a general framework to assess sparsity robustness and present a systematic study of the phenomenon in the FFN layers of modern LLMs, including diffusion LLMs. Our findings reveal universal patterns of activation sparsity in LLMs, provide insights into this phenomenon, and offer practical guidelines for exploiting it in model design and acceleration.
Related papers
- Meaningless Tokens, Meaningful Gains: How Activation Shifts Enhance LLM Reasoning [53.35553353785948]
Motivated by the puzzling observation that inserting long sequences of meaningless tokens before the query prompt can consistently enhance reasoning LLM performance, this work analyzes the underlying mechanism driving this phenomenon.<n>We find that the improvements arise from a redistribution of activations in the LLM's layers, where near zero activations become less frequent while large magnitude activations increase.<n>We propose a lightweight inference-time technique that modifies activations directly without altering the input sequence.
arXiv Detail & Related papers (2025-10-01T15:39:38Z) - LLM Unlearning via Neural Activation Redirection [24.157334866277534]
We propose LUNAR, a novel unlearning method grounded in the Linear Representation Hypothesis.<n>We show that LUNAR achieves state-of-the-art unlearning performance and superior controllability.
arXiv Detail & Related papers (2025-02-11T03:23:22Z) - Sparsing Law: Towards Large Language Models with Greater Activation Sparsity [64.15238674475619]
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated.<n>We propose PPL-$p%$ sparsity, a precise and performance-aware activation sparsity metric.<n>We show that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity.
arXiv Detail & Related papers (2024-11-04T17:59:04Z) - MOYU: A Theoretical Study on Massive Over-activation Yielded Uplifts in LLMs [20.404448253054014]
Massive Over-activation Yielded Uplifts(MOYU) is an inherent property of large language models.
Massive Over-activation Yielded Uplifts(MOYU) is a clever yet under-explored strategy designed to accelerate inference in these models.
arXiv Detail & Related papers (2024-06-18T12:57:33Z) - Dynamic Activation Pitfalls in LLaMA Models: An Empirical Study [20.404448253054014]
We investigate the efficacy of dynamic activation mechanisms within the LLaMA family of language models.
Our empirical findings have uncovered several inherent pitfalls in the current dynamic activation schemes.
arXiv Detail & Related papers (2024-05-15T11:42:42Z) - Characterizing Truthfulness in Large Language Model Generations with
Local Intrinsic Dimension [63.330262740414646]
We study how to characterize and predict the truthfulness of texts generated from large language models (LLMs)
We suggest investigating internal activations and quantifying LLM's truthfulness using the local intrinsic dimension (LID) of model activations.
arXiv Detail & Related papers (2024-02-28T04:56:21Z) - ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models [74.59731375779934]
Activation sparsity refers to the existence of weakly-contributed elements among activation outputs.<n>This paper introduces a simple and effective sparsification method named "ProSparse" to push LLMs for higher activation sparsity.
arXiv Detail & Related papers (2024-02-21T03:58:49Z) - ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse
LLMs [91.31204876440765]
We introduce a general method that defines neuron activation through neuron output magnitudes and a tailored magnitude threshold.
To find the most efficient activation function for sparse computation, we propose a systematic framework.
We conduct thorough experiments on LLMs utilizing different activation functions, including ReLU, SwiGLU, ReGLU, and ReLU$2$.
arXiv Detail & Related papers (2024-02-06T08:45:51Z) - ReLU Strikes Back: Exploiting Activation Sparsity in Large Language
Models [35.77063662562747]
Large Language Models (LLMs) with billions of parameters have drastically transformed AI applications.
Their demanding computation during inference has raised significant challenges for deployment on resource-constrained devices.
We demonstrate that using the ReLU activation function has a negligible impact on convergence and performance while significantly reducing computation and weight transfer.
arXiv Detail & Related papers (2023-10-06T20:01:33Z) - Large Language Models Are Latent Variable Models: Explaining and Finding
Good Demonstrations for In-Context Learning [104.58874584354787]
In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning.
This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models.
arXiv Detail & Related papers (2023-01-27T18:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.