Related papers: Rectifying Magnitude Neglect in Linear Attention

Rectifying Magnitude Neglect in Linear Attention

URL: http://arxiv.org/abs/2507.00698v2
Date: Thu, 24 Jul 2025 04:37:55 GMT
Title: Rectifying Magnitude Neglect in Linear Attention
Authors: Qihang Fan, Huaibo Huang, Yuang Ai, ran He,
Abstract summary: Linear Attention suffers from a significant performance degradation compared to standard Softmax Attention.<n>We propose Magnitude-Aware Linear Attention (MALA), which modifies the computation of Linear Attention to fully incorporate the Query's magnitude.
Score: 57.097694292570885
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As the core operator of Transformers, Softmax Attention exhibits excellent global modeling capabilities. However, its quadratic complexity limits its applicability to vision tasks. In contrast, Linear Attention shares a similar formulation with Softmax Attention while achieving linear complexity, enabling efficient global information modeling. Nevertheless, Linear Attention suffers from a significant performance degradation compared to standard Softmax Attention. In this paper, we analyze the underlying causes of this issue based on the formulation of Linear Attention. We find that, unlike Softmax Attention, Linear Attention entirely disregards the magnitude information of the Query. This prevents the attention score distribution from dynamically adapting as the Query scales. As a result, despite its structural similarity to Softmax Attention, Linear Attention exhibits a significantly different attention score distribution. Based on this observation, we propose Magnitude-Aware Linear Attention (MALA), which modifies the computation of Linear Attention to fully incorporate the Query's magnitude. This adjustment allows MALA to generate an attention score distribution that closely resembles Softmax Attention while exhibiting a more well-balanced structure. We evaluate the effectiveness of MALA on multiple tasks, including image classification, object detection, instance segmentation, semantic segmentation, natural language processing, speech recognition, and image generation. Our MALA achieves strong results on all of these tasks. Code will be available at https://github.com/qhfan/MALA

Related papers

On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective [3.1044138971639743]
Main drawback of softmax attention is the quadratic memory requirement and computational complexity with respect to the sequence length.<n>By replacing the softmax nonlinearity, linear attention and similar methods have been introduced to avoid the quadratic bottleneck of softmax attention.<n>This work demonstrates that linear attention is an approximation of softmax attention by deriving the recurrent form of softmax attention.
arXiv Detail & Related papers (2025-07-31T15:10:03Z)
SEMA: a Scalable and Efficient Mamba like Attention via Token Localization and Averaging [6.141765857284547]
We formulate both vanilla softmax attention and linear attention within the general framework.<n>We show that generalized attention disperses, that is, as the number of keys tends to infinity, the query assigns equal weights to all keys.<n>We show that Attention (SEMA) is a scalable and effective alternative beyond linear attention, outperforming recent vision Mamba models on increasingly larger scales of images.
arXiv Detail & Related papers (2025-06-10T00:03:19Z)
Breaking Complexity Barriers: High-Resolution Image Restoration with Rank Enhanced Linear Attention [54.42902794496325]
Linear attention, a variant of softmax attention, demonstrates promise in global context modeling.<n>We propose Rank Enhanced Linear Attention (RELA), a simple yet effective method that enriches feature representations by integrating a lightweight depthwise convolution.<n>Building upon RELA, we propose an efficient and effective image restoration Transformer, named LAformer.
arXiv Detail & Related papers (2025-05-22T02:57:23Z)
Bridging the Divide: Reconsidering Softmax and Linear Attention [116.34723260730405]
We present two key perspectives to understand and alleviate the limitations of linear attention.<n>We prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors.<n> Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short.
arXiv Detail & Related papers (2024-12-09T15:44:22Z)
Breaking the Low-Rank Dilemma of Linear Attention [61.55583836370135]
Linear attention provides a far more efficient solution by reducing the complexity to linear levels.<n>Our experiments indicate that this performance drop is due to the low-rank nature of linear attention's feature map.<n>We introduce Rank-Augmented Linear Attention (RALA), which rivals the performance of Softmax attention while maintaining linear complexity and high efficiency.
arXiv Detail & Related papers (2024-11-12T08:30:59Z)
FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity. Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead. We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z)
Linear Video Transformer with Feature Fixation [34.324346469406926]
Vision Transformers have achieved impressive performance in video classification, while suffering from the quadratic complexity caused by the Softmax attention mechanism. We propose a feature fixation module to reweight the feature importance of the query and key before computing linear attention. We achieve state-of-the-art performance among linear video Transformers on three popular video classification benchmarks.
arXiv Detail & Related papers (2022-10-15T02:20:50Z)
cosFormer: Rethinking Softmax in Attention [60.557869510885205]
kernel methods are often adopted to reduce the complexity by approximating the softmax operator. Due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops. We propose a linear transformer called cosFormer that can achieve comparable or better accuracy to the vanilla transformer.
arXiv Detail & Related papers (2022-02-17T17:53:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.