Related papers: Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal Representation

Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal Representation

URL: http://arxiv.org/abs/2305.19798v2
Date: Tue, 5 Dec 2023 09:26:05 GMT
Title: Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal Representation
Authors: Yingyi Chen, Qinghua Tao, Francesco Tonin, Johan A.K. Suykens
Abstract summary: We provide a new perspective to represent and optimize self-attention through asymmetric Kernel Singular Value Decomposition (KSVD) We show that KSVD optimization can be implemented by simply minimizing a regularization loss, so that low-rank property is promoted without extra decomposition. This is the first work that provides a primal-dual representation for the asymmetric kernel in self-attention and successfully applies it to modeling and optimization.
Score: 21.87428356353377
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recently, a new line of works has emerged to understand and improve self-attention in Transformers by treating it as a kernel machine. However, existing works apply the methods for symmetric kernels to the asymmetric self-attention, resulting in a nontrivial gap between the analytical understanding and numerical implementation. In this paper, we provide a new perspective to represent and optimize self-attention through asymmetric Kernel Singular Value Decomposition (KSVD), which is also motivated by the low-rank property of self-attention normally observed in deep layers. Through asymmetric KSVD, $i$) a primal-dual representation of self-attention is formulated, where the optimization objective is cast to maximize the projection variances in the attention outputs; $ii$) a novel attention mechanism, i.e., Primal-Attention, is proposed via the primal representation of KSVD, avoiding explicit computation of the kernel matrix in the dual; $iii$) with KKT conditions, we prove that the stationary solution to the KSVD optimization in Primal-Attention yields a zero-value objective. In this manner, KSVD optimization can be implemented by simply minimizing a regularization loss, so that low-rank property is promoted without extra decomposition. Numerical experiments show state-of-the-art performance of our Primal-Attention with improved efficiency. Moreover, we demonstrate that the deployed KSVD optimization regularizes Primal-Attention with a sharper singular value decay than that of the canonical self-attention, further verifying the great potential of our method. To the best of our knowledge, this is the first work that provides a primal-dual representation for the asymmetric kernel in self-attention and successfully applies it to modeling and optimization.

Related papers

ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans [13.695885742446027]
Self-attention can lead to over-concentration on a few tokens during training, resulting in suboptimal information flow. We introduce a novel, fully parallelizable doubly-stochastic attention mechanism based on sliced optimal transport. Our method enforces doubleity without iterative Sinkhorn normalization, significantly enhancing efficiency.
arXiv Detail & Related papers (2025-02-11T21:20:48Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)
Learning in Feature Spaces via Coupled Covariances: Asymmetric Kernel SVD and Nyström method [21.16129116282759]
We introduce a new asymmetric learning paradigm based on coupled covariance eigenproblem (CCE) We formalize the asymmetric Nystr"om method through a finite sample approximation to speed up training.
arXiv Detail & Related papers (2024-06-13T02:12:18Z)
Implicit Bias and Fast Convergence Rates for Self-attention [30.08303212679308]
Self-attention, the core mechanism of transformers, distinguishes them from traditional neural networks and drives their outstanding performance. We investigate the implicit bias of gradient descent (GD) in training a self-attention layer with fixed linear decoder in binary. We provide the first finite-time convergence rate for $W_t$ to $W_mm$, along with the rate of sparsification in the attention map.
arXiv Detail & Related papers (2024-02-08T15:15:09Z)
Self-Attention through Kernel-Eigen Pair Sparse Variational Gaussian Processes [20.023544206079304]
We propose Kernel-Eigen Pair Sparse Variational Gaussian Processes (KEP-SVGP) for building uncertainty-aware self-attention. Experiments verify our excellent performances and efficiency on in-distribution, distribution-shift and out-of-distribution benchmarks.
arXiv Detail & Related papers (2024-02-02T15:05:13Z)
Algorithmic Regularization in Tensor Optimization: Towards a Lifted Approach in Matrix Sensing [28.295987262533075]
Gradient descent (GD) is crucial for generalization in machine learning models. We show that GD induces implicit regularization, promoting compact representations. Our findings underscore the significance of tensormet pararization of matrix sensing.
arXiv Detail & Related papers (2023-10-24T06:40:26Z)
Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z)
Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem. We characterize the implicit bias of 1-layer transformers optimized with gradient descent. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z)
Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers [52.468311268601056]
This paper analyzes attention through the lens of convex duality. We derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality. We show how self-attention networks implicitly cluster the tokens, based on their latent similarity.
arXiv Detail & Related papers (2022-05-17T04:01:15Z)
On the Efficient Implementation of the Matrix Exponentiated Gradient Algorithm for Low-Rank Matrix Optimization [26.858608065417663]
Convex optimization over the spectrahedron has important applications in machine learning, signal processing and statistics. We propose efficient implementations of MEG, which are tailored for optimization with low-rank matrices, and only use a single low-rank SVD on each iteration. We also provide efficiently-computable certificates for the correct convergence of our methods.
arXiv Detail & Related papers (2020-12-18T19:14:51Z)
Understanding Implicit Regularization in Over-Parameterized Single Index Model [55.41685740015095]
We design regularization-free algorithms for the high-dimensional single index model. We provide theoretical guarantees for the induced implicit regularization phenomenon.
arXiv Detail & Related papers (2020-07-16T13:27:47Z)
Controllable Orthogonalization in Training DNNs [96.1365404059924]
Orthogonality is widely used for training deep neural networks (DNNs) due to its ability to maintain all singular values of the Jacobian close to 1. This paper proposes a computationally efficient and numerically stable orthogonalization method using Newton's iteration (ONI) We show that our method improves the performance of image classification networks by effectively controlling the orthogonality to provide an optimal tradeoff between optimization benefits and representational capacity reduction. We also show that ONI stabilizes the training of generative adversarial networks (GANs) by maintaining the Lipschitz continuity of a network, similar to spectral normalization (
arXiv Detail & Related papers (2020-04-02T10:14:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.