Related papers: Attention Needs to Focus: A Unified Perspective on Attention Allocation

Attention Needs to Focus: A Unified Perspective on Attention Allocation

URL: http://arxiv.org/abs/2601.00919v2
Date: Wed, 07 Jan 2026 18:20:49 GMT
Title: Attention Needs to Focus: A Unified Perspective on Attention Allocation
Authors: Zichuan Fu, Wentao Song, Guojing Li, Yejing Wang, Xian Wu, Yimin Deng, Hanyu Yan, Yefeng Zheng, Xiangyu Zhao,
Abstract summary: The Transformer architecture is a cornerstone of modern Large Language Models (LLMs)<n>Standard attention mechanism is plagued by well-documented issues: representational collapse and attention sink.<n>We present a unified perspective, arguing that both can be traced to a common root -- improper attention allocation.
Score: 37.34801068995858
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Transformer architecture, a cornerstone of modern Large Language Models (LLMs), has achieved extraordinary success in sequence modeling, primarily due to its attention mechanism. However, despite its power, the standard attention mechanism is plagued by well-documented issues: representational collapse and attention sink. Although prior work has proposed approaches for these issues, they are often studied in isolation, obscuring their deeper connection. In this paper, we present a unified perspective, arguing that both can be traced to a common root -- improper attention allocation. We identify two failure modes: 1) Attention Overload, where tokens receive comparable high weights, blurring semantic features that lead to representational collapse; 2) Attention Underload, where no token is semantically relevant, yet attention is still forced to distribute, resulting in spurious focus such as attention sink. Building on this insight, we introduce Lazy Attention, a novel mechanism designed for a more focused attention distribution. To mitigate overload, it employs positional discrimination across both heads and dimensions to sharpen token distinctions. To counteract underload, it incorporates Elastic-Softmax, a modified normalization function that relaxes the standard softmax constraint to suppress attention on irrelevant tokens. Experiments on the FineWeb-Edu corpus, evaluated across nine diverse benchmarks, demonstrate that Lazy Attention successfully mitigates attention sink and achieves competitive performance compared to both standard attention and modern architectures, while reaching up to 59.58% attention sparsity.

Related papers

Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse [11.042559854770422]
We show that the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers.<n>To mitigate head collapse, we propose a sink-aware training algorithm with an auxiliary load balancing loss designed for attention layers.
arXiv Detail & Related papers (2026-02-01T12:45:39Z)
Rectified SpaAttn: Revisiting Attention Sparsity for Efficient Video Generation [22.35209793690791]
Diffusion Transformers dominate video generation, but the quadratic complexity of attention introduces substantial latency.<n> Attention sparsity reduces computational costs by focusing on critical tokens while ignoring non-critical tokens.<n>Existing methods induce systematic biases in attention allocation.<n>We propose Rectified SpaAttn, which rectifies attention allocation with implicit full attention reference.
arXiv Detail & Related papers (2025-11-25T02:03:54Z)
Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study [38.492552119793]
We investigate an alternative attention mechanism based on the stick-breaking process in larger scale settings.<n>We study the implications of replacing the conventional softmax-based attention mechanism with stick-breaking attention.<n>When used as a drop-in replacement for current softmax+RoPE attention systems, we find that stick-breaking attention performs competitively with current methods.
arXiv Detail & Related papers (2024-10-23T15:51:13Z)
When Attention Sink Emerges in Language Models: An Empirical View [39.36282162213973]
Language Models (LMs) assign significant attention to the first token, even if it is not semantically important.<n>This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others.<n>We first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models.
arXiv Detail & Related papers (2024-10-14T17:50:28Z)
Elliptical Attention [1.7597562616011944]
Pairwise dot-product self-attention is key to the success of transformers that achieve state-of-the-art performance across a variety of applications in language and vision. We propose using a Mahalanobis distance metric for computing the attention weights to stretch the underlying feature space in directions of high contextual relevance.
arXiv Detail & Related papers (2024-06-19T18:38:11Z)
Guiding Visual Question Answering with Attention Priors [76.21671164766073]
We propose to guide the attention mechanism using explicit linguistic-visual grounding. This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects. The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process.
arXiv Detail & Related papers (2022-05-25T09:53:47Z)
Alignment Attention by Matching Key and Query Distributions [48.93793773929006]
This paper introduces alignment attention that explicitly encourages self-attention to match the distributions of the key and query within each head. It is simple to convert any models with self-attention, including pre-trained ones, to the proposed alignment attention. On a variety of language understanding tasks, we show the effectiveness of our method in accuracy, uncertainty estimation, generalization across domains, and robustness to adversarial attacks.
arXiv Detail & Related papers (2021-10-25T00:54:57Z)
More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints [63.08768589044052]
We propose Contrastive Content Re-sourcing ( CCR) and Contrastive Content Swapping ( CCS) constraints to address such limitation. CCR and CCS constraints supervise the training of attention models in a contrastive learning manner without requiring explicit attention annotations. Experiments on both Flickr30k and MS-COCO datasets demonstrate that integrating these attention constraints into two state-of-the-art attention-based models improves the model performance.
arXiv Detail & Related papers (2021-05-20T08:48:10Z)
Causal Attention for Vision-Language Tasks [142.82608295995652]
We present a novel attention mechanism: Causal Attention (CATT) CATT removes the ever-elusive confounding effect in existing attention-based vision-language models. In particular, we show that CATT has great potential in large-scale pre-training.
arXiv Detail & Related papers (2021-03-05T06:38:25Z)
Exploring Self-attention for Image Recognition [151.12000247183636]
We consider two forms of self-attention for image recognition. One is pairwise self-attention, which generalizes standard dot-product attention. The other is patchwise self-attention, which is strictly more powerful than convolution.
arXiv Detail & Related papers (2020-04-28T16:01:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.