NaLaFormer: Norm-Aware Linear Attention for Transformer Models
- URL: http://arxiv.org/abs/2506.21137v1
- Date: Thu, 26 Jun 2025 10:47:39 GMT
- Title: NaLaFormer: Norm-Aware Linear Attention for Transformer Models
- Authors: Weikang Meng, Yadan Luo, Liangyu Huo, Yaowei Wang, Xin Li, Zheng Zhang,
- Abstract summary: We propose a novel Norm-Aware Linear Attention mechanism to restore norm-guided dynamic spikiness and recover kernel-perturbed norm distributions.<n>We conduct extensive experiments demonstrating that the NaLaFormer improves performance on vision and language tasks, enhancing both expressiveness and efficiency by up to 4.2%.
- Score: 39.97155378043193
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Linear attention has emerged as a viable alternative to softmax attention by reducing complexity from quadratic to linear in sequence length. To preserve two fundamental properties of softmax, non-negativity and entropy reduction, current works employ various linearly separatable kernel functions with $L1$ normalization instead of softmax operator. However, query norms are neglected by the normalization operation in linear attention, such degradation heavily leads to an entropy gap. Meanwhile, existing works inhibit negative values of query and key vectors resulting in a missing inner-product interactions after being mapped. To address these dual challenges, we propose a novel Norm-Aware Linear Attention mechanism serving to restore norm-guided dynamic spikiness and recover kernel-perturbed norm distributions. Specifically, we first decouple query and key matrices into two components: norm and direction, to achieve norm-aware spikiness control and norm consistency, respectively. We mathematically reveal that the extent of entropy reduction varies with the query norm in softmax normalization, motivating a query-norm aware kernel function for dynamic control over entropy reduction. Furthermore, to ensure norm consistency and enforce non-negativity constraints, we employ a norm-preserving mapping to project all elements of the angular matrix into positive values, leveraging cosine similarity to inhibit dimensions with opposite directions. We conduct extensive experiments demonstrating that the NaLaFormer improves performance on vision and language tasks, enhancing both expressiveness and efficiency by up to 4.2\%.
Related papers
- PolaFormer: Polarity-aware Linear Attention for Vision Transformers [16.35834984488344]
Linear attention has emerged as a promising alternative to softmax-based attention.<n>We propose a polarity-aware linear attention mechanism that explicitly models both same-signed and opposite-signed query-key interactions.<n>For simplicity, and recognizing the distinct contributions of each dimension, we employ a learnable power function for rescaling.
arXiv Detail & Related papers (2025-01-25T03:46:35Z) - Benign overfitting in Fixed Dimension via Physics-Informed Learning with Smooth Inductive Bias [8.668428992331808]
We develop an Sobolev norm learning curve for kernel ridge(less) regression when addressing (elliptical) linear inverse problems.<n>Our results show that the PDE operators in the inverse problem can stabilize the variance and even behave benign overfitting for fixed-dimensional problems.
arXiv Detail & Related papers (2024-06-13T14:54:30Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - The Inductive Bias of Flatness Regularization for Deep Matrix
Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks.
We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z) - Penalising the biases in norm regularisation enforces sparsity [24.43739371803548]
This work shows the parameters' norm required to represent a function is given by the total variation of its second derivative, weighted by a $sqrt1+x2$ factor.<n> Notably, this weighting factor disappears when the norm of bias terms is not regularised.
arXiv Detail & Related papers (2023-03-02T15:33:18Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - Understanding Implicit Regularization in Over-Parameterized Single Index
Model [55.41685740015095]
We design regularization-free algorithms for the high-dimensional single index model.
We provide theoretical guarantees for the induced implicit regularization phenomenon.
arXiv Detail & Related papers (2020-07-16T13:27:47Z) - To Each Optimizer a Norm, To Each Norm its Generalization [31.682969645989512]
We study the implicit regularization of optimization methods for linear models interpolating the training data in the under-parametrized and over-parametrized regimes.
We argue that analyzing convergence to the standard maximum l2-margin is arbitrary and show that minimizing the norm induced by the data results in better generalizations.
arXiv Detail & Related papers (2020-06-11T21:07:38Z) - Multi-Objective Matrix Normalization for Fine-grained Visual Recognition [153.49014114484424]
Bilinear pooling achieves great success in fine-grained visual recognition (FGVC)
Recent methods have shown that the matrix power normalization can stabilize the second-order information in bilinear features.
We propose an efficient Multi-Objective Matrix Normalization (MOMN) method that can simultaneously normalize a bilinear representation.
arXiv Detail & Related papers (2020-03-30T08:40:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.