Related papers: Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

URL: http://arxiv.org/abs/2602.01203v1
Date: Sun, 01 Feb 2026 12:45:39 GMT
Title: Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse
Authors: Zizhuo Fu, Wenxuan Zeng, Runsheng Wang, Meng Li,
Abstract summary: We show that the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers.<n>To mitigate head collapse, we propose a sink-aware training algorithm with an auxiliary load balancing loss designed for attention layers.
Score: 11.042559854770422
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) often assign disproportionate attention to the first token, a phenomenon known as the attention sink. Several recent approaches aim to address this issue, including Sink Attention in GPT-OSS and Gated Attention in Qwen3-Next. However, a comprehensive analysis of the relationship among these attention mechanisms is lacking. In this work, we provide both theoretical and empirical evidence demonstrating that the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers. This insight explains the head collapse phenomenon observed in prior work, where only a fixed subset of attention heads contributes to generation. To mitigate head collapse, we propose a sink-aware training algorithm with an auxiliary load balancing loss designed for attention layers. Extensive experiments show that our method achieves effective head load balancing and improves model performance across Vanilla Attention, Sink Attention, and Gated Attention. We hope this study offers a new perspective on attention mechanisms and encourages further exploration of the inherent MoE structure within attention layers.

Related papers

Revealing the Attention Floating Mechanism in Masked Diffusion Models [52.74142815156738]
Masked diffusion models (MDMs) leverage bidirectional attention and a denoising process.<n>This paper investigates the attention behaviors in MDMs, revealing the phenomenon of Attention Floating.
arXiv Detail & Related papers (2026-01-12T09:10:05Z)
Attention Needs to Focus: A Unified Perspective on Attention Allocation [37.34801068995858]
The Transformer architecture is a cornerstone of modern Large Language Models (LLMs)<n>Standard attention mechanism is plagued by well-documented issues: representational collapse and attention sink.<n>We present a unified perspective, arguing that both can be traced to a common root -- improper attention allocation.
arXiv Detail & Related papers (2026-01-01T08:39:15Z)
Reversed Attention: On The Gradient Descent Of Attention Layers In GPT [55.2480439325792]
We study the mathematics of the backward pass of attention, revealing that it implicitly calculates an attention matrix we refer to as "Reversed Attention"<n>In an experimental setup, we showcase the ability of Reversed Attention to directly alter the forward pass of attention, without modifying the model's weights.<n>In addition to enhancing the comprehension of how LM configure attention layers during backpropagation, Reversed Attention maps contribute to a more interpretable backward pass.
arXiv Detail & Related papers (2024-12-22T13:48:04Z)
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs [77.66717051042032]
Practitioners have consistently observed three puzzling phenomena in transformer-based large language models. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights. We elucidate the mechanisms behind extreme-token phenomena.
arXiv Detail & Related papers (2024-10-17T17:54:06Z)
Attention mechanisms for physiological signal deep learning: which attention should we take? [0.0]
We experimentally analyze four attention mechanisms (e.g., squeeze-and-excitation, non-local, convolutional block attention module, and multi-head self-attention) and three convolutional neural network (CNN) architectures. We evaluate multiple combinations for performance and convergence of physiological signal deep learning model.
arXiv Detail & Related papers (2022-07-04T07:24:08Z)
Guiding Visual Question Answering with Attention Priors [76.21671164766073]
We propose to guide the attention mechanism using explicit linguistic-visual grounding. This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects. The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process.
arXiv Detail & Related papers (2022-05-25T09:53:47Z)
More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints [63.08768589044052]
We propose Contrastive Content Re-sourcing ( CCR) and Contrastive Content Swapping ( CCS) constraints to address such limitation. CCR and CCS constraints supervise the training of attention models in a contrastive learning manner without requiring explicit attention annotations. Experiments on both Flickr30k and MS-COCO datasets demonstrate that integrating these attention constraints into two state-of-the-art attention-based models improves the model performance.
arXiv Detail & Related papers (2021-05-20T08:48:10Z)
Attention in Attention Network for Image Super-Resolution [18.2279472158217]
We quantify and visualize the static attention mechanisms and show that not all attention modules are equally beneficial. We propose attention in attention network (A$2$N) for highly accurate image SR. Our model could achieve superior trade-off performances comparing with state-of-the-art lightweight networks.
arXiv Detail & Related papers (2021-04-19T17:59:06Z)
Attention Meets Perturbations: Robust and Interpretable Attention with Adversarial Training [7.106986689736828]
We propose a general training technique for natural language processing tasks, including AT for attention (Attention AT) and more interpretable AT for attention (Attention iAT) The proposed techniques improved the prediction performance and the model interpretability by exploiting the mechanisms with AT.
arXiv Detail & Related papers (2020-09-25T07:26:45Z)
Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference [68.12511526813991]
We provide a novel understanding of multi-head attention from a Bayesian perspective. We propose a non-parametric approach that explicitly improves the repulsiveness in multi-head attention. Experiments on various attention models and applications demonstrate that the proposed repulsive attention can improve the learned feature diversity.
arXiv Detail & Related papers (2020-09-20T06:32:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.