Related papers: Value-State Gated Attention for Mitigating Extreme-Token Phenomena in Transformers

Value-State Gated Attention for Mitigating Extreme-Token Phenomena in Transformers

URL: http://arxiv.org/abs/2510.09017v1
Date: Fri, 10 Oct 2025 05:40:53 GMT
Title: Value-State Gated Attention for Mitigating Extreme-Token Phenomena in Transformers
Authors: Rui Bu, Haofeng Zhong, Wenzheng Chen, Yangyan Li,
Abstract summary: Large models based on the Transformer architecture are susceptible to extreme-token phenomena, such as attention sinks and value-state drains.<n>We propose Value-State Gated Attention (VGA), a simple, dedicated, and stable architectural mechanism for performing 'no-op' attention efficiently.<n>Our experiments demonstrate that VGA significantly mitigates the formation of attention sinks and stabilizes value-state norms, leading to improved performance, robust quantization fidelity, and enhanced model interpretability.
Score: 9.323230501418509
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large models based on the Transformer architecture are susceptible to extreme-token phenomena, such as attention sinks and value-state drains. These issues, which degrade model performance, quantization fidelity, and interpretability, arise from a problematic mutual reinforcement mechanism where the model learns an inefficient 'no-op' behavior by focusing attention on tokens with near-zero value states. In this paper, we propose Value-State Gated Attention (VGA), a simple, dedicated, and stable architectural mechanism for performing 'no-op' attention efficiently by directly breaking this cycle. VGA introduces a learnable, data-dependent gate, computed directly from the value vectors (V), to modulate the output. Through a theoretical analysis of the underlying gradients, we show that gating the value-state with a function of itself is more effective at decoupling value and attention score updates than prior methods that gate on input embeddings. This creates a direct regulatory pathway that allows the model to suppress a token's contribution based on its emergent value representation. Our experiments demonstrate that VGA significantly mitigates the formation of attention sinks and stabilizes value-state norms, leading to improved performance, robust quantization fidelity, and enhanced model interpretability.

Related papers

Understanding Degradation with Vision Language Model [56.09241449206817]
Understanding visual degradations is a critical yet challenging problem in computer vision.<n>We introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning.<n>We also introduce textbfDU-110k, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations.
arXiv Detail & Related papers (2026-02-04T13:51:15Z)
Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models [13.32858759983739]
Large Vision-Language Models (LVLMs) often suffer from object hallucination, generating text inconsistent with visual inputs.<n>Existing inference-time interventions to mitigate this issue present a challenging trade-off.<n>We present Residual-Update Directed DEcoding Regulation (RUDDER), a framework that steers LVLMs towards visually-grounded generation.
arXiv Detail & Related papers (2025-11-13T13:29:38Z)
ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention [86.93601565563954]
ScaleWeaver is a framework designed to achieve high-fidelity, controllable generation upon advanced visual autoregressive( VAR) models.<n>The proposed Reference Attention module discards the unnecessary attention from image$rightarrow$condition, reducing computational cost.<n>Experiments show that ScaleWeaver delivers high-quality generation and precise control while attaining superior efficiency over diffusion-based methods.
arXiv Detail & Related papers (2025-10-16T17:00:59Z)
Transformer-Based Indirect Structural Health Monitoring of Rail Infrastructure with Attention-Driven Detection and Localization of Transient Defects [1.1782896991259]
We introduce an incremental synthetic data benchmark designed to evaluate model robustness against progressively complex challenges.<n>We evaluate several established unsupervised models alongside our proposed Attention-Focused Transformer.<n>Our proposed model achieves accuracy comparable to the state-of-the-art solution while demonstrating better inference speed.
arXiv Detail & Related papers (2025-10-08T23:01:53Z)
Transformers Learn Faster with Semantic Focus [57.97235825738412]
We study sparse transformers in terms of learnability and generalization.<n>We find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models.
arXiv Detail & Related papers (2025-06-17T01:19:28Z)
Does Acceleration Cause Hidden Instability in Vision Language Models? Uncovering Instance-Level Divergence Through a Large-Scale Empirical Study [44.170933007736984]
Vision-Language Models (VLMs) are powerful yet computationally intensive for widespread practical deployments.<n>Current acceleration evaluations primarily target minimal overall performance degradation, overlooking a crucial question: does the accelerated model still give the same answers to the same questions as it did before acceleration?<n>This is vital for stability-centered industrial applications where consistently correct answers for specific, known situations are paramount, such as in AI-based disease diagnosis.
arXiv Detail & Related papers (2025-03-09T22:16:48Z)
Transformer Meets Twicing: Harnessing Unattended Residual Information [2.1605931466490795]
Transformer-based deep learning models have achieved state-of-the-art performance across numerous language and vision tasks.<n>While the self-attention mechanism has proven capable of handling complex data patterns, it has been observed that the representational capacity of the attention matrix degrades significantly across transformer layers.<n>We propose the Twicing Attention, a novel attention mechanism that uses kernel twicing procedure in nonparametric regression to alleviate the low-pass behavior of associated NLM smoothing.
arXiv Detail & Related papers (2025-03-02T01:56:35Z)
Gradformer: Graph Transformer with Exponential Decay [69.50738015412189]
Self-attention mechanism in Graph Transformers (GTs) overlooks the graph's inductive biases, particularly biases related to structure. This paper presents Gradformer, a method innovatively integrating GT with the intrinsic inductive bias. Gradformer consistently outperforms the Graph Neural Network and GT baseline models in various graph classification and regression tasks.
arXiv Detail & Related papers (2024-04-24T08:37:13Z)
PIDformer: Transformer Meets Control Theory [28.10913642120948]
We unveil self-attention as an autonomous state-space model that inherently promotes smoothness in its solutions. We incorporate a Proportional-Integral-Derivative (PID) closed-loop feedback control system with a reference point into the model to improve robustness and representation capacity. Motivated by this control framework, we derive a novel class of transformers, PID-controlled Transformer (PIDformer)
arXiv Detail & Related papers (2024-02-25T05:04:51Z)
Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space. We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z)
Towards Practical Control of Singular Values of Convolutional Layers [65.25070864775793]
Convolutional neural networks (CNNs) are easy to train, but their essential properties, such as generalization error and adversarial robustness, are hard to control. Recent research demonstrated that singular values of convolutional layers significantly affect such elusive properties. We offer a principled approach to alleviating constraints of the prior art at the expense of an insignificant reduction in layer expressivity.
arXiv Detail & Related papers (2022-11-24T19:09:44Z)
Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks. A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances. We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.