PolaFormer: Polarity-aware Linear Attention for Vision Transformers
- URL: http://arxiv.org/abs/2501.15061v2
- Date: Tue, 04 Mar 2025 07:00:07 GMT
- Title: PolaFormer: Polarity-aware Linear Attention for Vision Transformers
- Authors: Weikang Meng, Yadan Luo, Xin Li, Dongmei Jiang, Zheng Zhang,
- Abstract summary: Linear attention has emerged as a promising alternative to softmax-based attention.<n>We propose a polarity-aware linear attention mechanism that explicitly models both same-signed and opposite-signed query-key interactions.<n>For simplicity, and recognizing the distinct contributions of each dimension, we employ a learnable power function for rescaling.
- Score: 16.35834984488344
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Linear attention has emerged as a promising alternative to softmax-based attention, leveraging kernelized feature maps to reduce complexity from quadratic to linear in sequence length. However, the non-negative constraint on feature maps and the relaxed exponential function used in approximation lead to significant information loss compared to the original query-key dot products, resulting in less discriminative attention maps with higher entropy. To address the missing interactions driven by negative values in query-key pairs, we propose a polarity-aware linear attention mechanism that explicitly models both same-signed and opposite-signed query-key interactions, ensuring comprehensive coverage of relational information. Furthermore, to restore the spiky properties of attention maps, we provide a theoretical analysis proving the existence of a class of element-wise functions (with positive first and second derivatives) that can reduce entropy in the attention distribution. For simplicity, and recognizing the distinct contributions of each dimension, we employ a learnable power function for rescaling, allowing strong and weak attention signals to be effectively separated. Extensive experiments demonstrate that the proposed PolaFormer improves performance on various vision tasks, enhancing both expressiveness and efficiency by up to 4.6%.
Related papers
- NaLaFormer: Norm-Aware Linear Attention for Transformer Models [39.97155378043193]
We propose a novel Norm-Aware Linear Attention mechanism to restore norm-guided dynamic spikiness and recover kernel-perturbed norm distributions.<n>We conduct extensive experiments demonstrating that the NaLaFormer improves performance on vision and language tasks, enhancing both expressiveness and efficiency by up to 4.2%.
arXiv Detail & Related papers (2025-06-26T10:47:39Z) - Transformers Learn Faster with Semantic Focus [57.97235825738412]
We study sparse transformers in terms of learnability and generalization.<n>We find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models.
arXiv Detail & Related papers (2025-06-17T01:19:28Z) - Focus What Matters: Matchability-Based Reweighting for Local Feature Matching [6.361840891399624]
We propose a novel attention reweighting mechanism that simultaneously incorporates a learnable bias term into the attention logits.<n>Experiments conducted on three benchmark datasets validate the effectiveness of our method.
arXiv Detail & Related papers (2025-05-04T15:50:28Z) - ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans [13.695885742446027]
Self-attention can lead to over-concentration on a few tokens during training, resulting in suboptimal information flow.
We introduce a novel, fully parallelizable doubly-stochastic attention mechanism based on sliced optimal transport.
Our method enforces doubleity without iterative Sinkhorn normalization, significantly enhancing efficiency.
arXiv Detail & Related papers (2025-02-11T21:20:48Z) - Bridging the Divide: Reconsidering Softmax and Linear Attention [116.34723260730405]
We present two key perspectives to understand and alleviate the limitations of linear attention.<n>We prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors.<n> Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short.
arXiv Detail & Related papers (2024-12-09T15:44:22Z) - DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification.
Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z) - FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity.
Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead.
We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z) - Linear Self-Attention Approximation via Trainable Feedforward Kernel [77.34726150561087]
In pursuit of faster computation, Efficient Transformers demonstrate an impressive variety of approaches.
We aim to expand the idea of trainable kernel methods to approximate the self-attention mechanism of the Transformer architecture.
arXiv Detail & Related papers (2022-11-08T08:14:11Z) - Linear Video Transformer with Feature Fixation [34.324346469406926]
Vision Transformers have achieved impressive performance in video classification, while suffering from the quadratic complexity caused by the Softmax attention mechanism.
We propose a feature fixation module to reweight the feature importance of the query and key before computing linear attention.
We achieve state-of-the-art performance among linear video Transformers on three popular video classification benchmarks.
arXiv Detail & Related papers (2022-10-15T02:20:50Z) - Contrastive Learning Can Find An Optimal Basis For Approximately
View-Invariant Functions [18.440569330385323]
We show that multiple contrastive learning methods can be reinterpreted as learning kernel functions that approximate a fixed positive-pair kernel.
We prove that a simple representation obtained by combining this kernel with PCA provably minimizes the worst-case approximation error of linear predictors.
arXiv Detail & Related papers (2022-10-04T20:02:52Z) - Unraveling Attention via Convex Duality: Analysis and Interpretations of
Vision Transformers [52.468311268601056]
This paper analyzes attention through the lens of convex duality.
We derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality.
We show how self-attention networks implicitly cluster the tokens, based on their latent similarity.
arXiv Detail & Related papers (2022-05-17T04:01:15Z) - Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency [111.83670279016599]
We study reinforcement learning for partially observed decision processes (POMDPs) with infinite observation and state spaces.
We make the first attempt at partial observability and function approximation for a class of POMDPs with a linear structure.
arXiv Detail & Related papers (2022-04-20T21:15:38Z) - Joint Inference of Multiple Graphs from Matrix Polynomials [34.98220454543502]
Inferring graph structure from observations on the nodes is an important and popular network science task.
We study the problem of jointly inferring multiple graphs from the observation of signals at their nodes.
We propose a convex optimization method along with sufficient conditions that guarantee the recovery of the true graphs.
arXiv Detail & Related papers (2020-10-16T02:45:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.