Related papers: A Refined Analysis of Massive Activations in LLMs

A Refined Analysis of Massive Activations in LLMs

URL: http://arxiv.org/abs/2503.22329v1
Date: Fri, 28 Mar 2025 11:08:34 GMT
Title: A Refined Analysis of Massive Activations in LLMs
Authors: Louis Owen, Nilabhra Roy Chowdhury, Abhay Kumar, Fabian Güra,
Abstract summary: We conduct an analysis of massive activations across a broad range of large language models (LLMs)<n>Our findings challenge several prior assumptions, most importantly: (1) not all massive activations are detrimental, i.e. suppressing them does not lead to an explosion of perplexity or a collapse in downstream task performance; and (2) proposed mitigation strategies such as Attention KV bias are model-specific and ineffective in certain cases.
Score: 0.3574867616159909
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Motivated in part by their relevance for low-precision training and quantization, massive activations in large language models (LLMs) have recently emerged as a topic of interest. However, existing analyses are limited in scope, and generalizability across architectures is unclear. This paper helps address some of these gaps by conducting an analysis of massive activations across a broad range of LLMs, including both GLU-based and non-GLU-based architectures. Our findings challenge several prior assumptions, most importantly: (1) not all massive activations are detrimental, i.e. suppressing them does not lead to an explosion of perplexity or a collapse in downstream task performance; (2) proposed mitigation strategies such as Attention KV bias are model-specific and ineffective in certain cases. We consequently investigate novel hybrid mitigation strategies; in particular pairing Target Variance Rescaling (TVR) with Attention KV bias or Dynamic Tanh (DyT) successfully balances the mitigation of massive activations with preserved downstream model performance in the scenarios we investigated. Our code is available at: https://github.com/bluorion-com/refine_massive_activations.

Related papers

Rethinking the Outlier Distribution in Large Language Models: An In-depth Study [4.740962650068888]
Outliers often cause considerable quantization errors, leading to degraded model performance.<n>Recent studies have identified two common types of outliers in large language models: massive activations and channel-wise outliers.
arXiv Detail & Related papers (2025-05-27T18:48:40Z)
R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs. Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z)
Innate Reasoning is Not Enough: In-Context Learning Enhances Reasoning Large Language Models with Less Overthinking [39.48406368755411]
Large Language Models (LLMs) have introduced Reasoning Large Language Models (RLLMs)<n>RLLMs exhibit innate Chain-of-Thought (CoT) reasoning capability obtained from training, leading to a natural question: "Is CoT prompting necessary to enhance the reasoning capability of RLLMs?"<n>We present the first comprehensive analysis of the impacts of Zero-shot CoT and Few-shot CoT on RLLMs across mathematical reasoning tasks.
arXiv Detail & Related papers (2025-03-25T12:37:22Z)
U-NIAH: Unified RAG and LLM Evaluation for Long Context Needle-In-A-Haystack [9.760456105567078]
This paper introduces U-NIAH, a unified framework that systematically compares Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) Our framework incorporates multi-needle, long-needle, and needle-in-needle configurations, along with different retrieval settings. Our findings show that RAG significantly enhances smaller LLMs by mitigating the "lost-in-the-middle" effect and improving robustness.
arXiv Detail & Related papers (2025-03-01T05:05:24Z)
Mitigating Forgetting in LLM Fine-Tuning via Low-Perplexity Token Learning [61.99353167168545]
We show that fine-tuning with LLM-generated data improves target task performance and reduces non-target task degradation.<n>This is the first work to provide an empirical explanation based on token perplexity reduction to mitigate catastrophic forgetting in LLMs after fine-tuning.
arXiv Detail & Related papers (2025-01-24T08:18:56Z)
Sparsing Law: Towards Large Language Models with Greater Activation Sparsity [62.09617609556697]
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated.<n>We propose PPL-$p%$ sparsity, a precise and performance-aware activation sparsity metric.<n>We show that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity.
arXiv Detail & Related papers (2024-11-04T17:59:04Z)
Not All Diffusion Model Activations Have Been Evaluated as Discriminative Features [115.33889811527533]
Diffusion models are initially designed for image generation. Recent research shows that the internal signals within their backbones, named activations, can also serve as dense features for various discriminative tasks.
arXiv Detail & Related papers (2024-10-04T16:05:14Z)
Defending Large Language Models Against Attacks With Residual Stream Activation Analysis [0.0]
Large Language Models (LLMs) are vulnerable to adversarial threats. This paper presents an innovative defensive strategy, given white box access to an LLM. We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification.
arXiv Detail & Related papers (2024-06-05T13:06:33Z)
Mitigating Quantization Errors Due to Activation Spikes in GLU-Based LLMs [5.408684636210501]
Post-training quantization (PTQ) has become a popular approach, quantizing weights and activations to lower precision. We show the challenges of activation quantization in GLU variants, which are widely used in feed-forward network (FFN) of modern large language models. We propose two empirical methods, Quantization-free Module (QFeM) and Quantization-free Prefix (QFeP), to isolate the activation spikes during quantization.
arXiv Detail & Related papers (2024-05-23T10:54:14Z)
Dynamic Activation Pitfalls in LLaMA Models: An Empirical Study [20.404448253054014]
We investigate the efficacy of dynamic activation mechanisms within the LLaMA family of language models. Our empirical findings have uncovered several inherent pitfalls in the current dynamic activation schemes.
arXiv Detail & Related papers (2024-05-15T11:42:42Z)
Endogenous Macrodynamics in Algorithmic Recourse [52.87956177581998]
Existing work on Counterfactual Explanations (CE) and Algorithmic Recourse (AR) has largely focused on single individuals in a static environment. We show that many of the existing methodologies can be collectively described by a generalized framework. We then argue that the existing framework does not account for a hidden external cost of recourse, that only reveals itself when studying the endogenous dynamics of recourse at the group level.
arXiv Detail & Related papers (2023-08-16T07:36:58Z)
Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study [90.34226812493083]
This work aims to investigate the impact of quantization on emphemergent abilities, which are important characteristics that distinguish LLMs from small language models. Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation. To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning.
arXiv Detail & Related papers (2023-07-16T15:11:01Z)
Clustering Effect of (Linearized) Adversarial Robust Models [60.25668525218051]
We propose a novel understanding of adversarial robustness and apply it on more tasks including domain adaption and robustness boosting. Experimental evaluations demonstrate the rationality and superiority of our proposed clustering strategy.
arXiv Detail & Related papers (2021-11-25T05:51:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.