Massive Activations in Large Language Models
- URL: http://arxiv.org/abs/2402.17762v2
- Date: Wed, 14 Aug 2024 16:00:49 GMT
- Title: Massive Activations in Large Language Models
- Authors: Mingjie Sun, Xinlei Chen, J. Zico Kolter, Zhuang Liu,
- Abstract summary: We show the widespread existence of massive activations across various Large Language Models (LLMs)
Massive activations lead to the concentration of attention probabilities to their corresponding tokens, and implicit bias terms in the self-attention output.
- Score: 77.51561903918535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We observe an empirical phenomenon in Large Language Models (LLMs) -- very few activations exhibit significantly larger values than others (e.g., 100,000 times larger). We call them massive activations. First, we demonstrate the widespread existence of massive activations across various LLMs and characterize their locations. Second, we find their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs. Third, these massive activations lead to the concentration of attention probabilities to their corresponding tokens, and further, implicit bias terms in the self-attention output. Last, we also study massive activations in Vision Transformers. Code is available at https://github.com/locuslab/massive-activations.
Related papers
- Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding [58.364933651703524]
We show that concentrated massive values consistently emerge in specific regions of attention queries.
These massive values play a critical role in interpreting contextual knowledge.
We trace the emergence of massive values and find that such concentration is caused by Rotary Positional.
arXiv Detail & Related papers (2025-02-03T17:47:03Z) - Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings [69.35226485836641]
Excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation.
We propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE)
DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer.
arXiv Detail & Related papers (2024-11-29T11:24:23Z) - Sparsing Law: Towards Large Language Models with Greater Activation Sparsity [62.09617609556697]
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated.
We propose PPL-$p%$ sparsity, a precise and performance-aware activation sparsity metric.
We show that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity.
arXiv Detail & Related papers (2024-11-04T17:59:04Z) - Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs [77.66717051042032]
Practitioners have consistently observed three puzzling phenomena in transformer-based large language models.
These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights.
We elucidate the mechanisms behind extreme-token phenomena.
arXiv Detail & Related papers (2024-10-17T17:54:06Z) - ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models [18.992215985625492]
We evaluate active perception in Multimodal Large Language Models (MLLMs)
We focus on a specialized form of Visual Question Answering (VQA) that eases the evaluation yet challenging for existing MLLMs.
We observe that the ability to read and comprehend multiple images simultaneously plays a significant role in enabling active perception.
arXiv Detail & Related papers (2024-10-07T00:16:26Z) - Not All Diffusion Model Activations Have Been Evaluated as Discriminative Features [115.33889811527533]
Diffusion models are initially designed for image generation.
Recent research shows that the internal signals within their backbones, named activations, can also serve as dense features for various discriminative tasks.
arXiv Detail & Related papers (2024-10-04T16:05:14Z) - Are Bigger Encoders Always Better in Vision Large Models? [21.797332686137203]
multimodal large language models (MLLMs) have shown strong potential in real-world applications.
The scaling trend of vision language models (VLMs) under the current mainstream paradigm has not been extensively studied.
We conduct experiments on the pretraining stage of MLLMs using different encoder sizes and large language model (LLM) sizes.
arXiv Detail & Related papers (2024-08-01T15:05:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.