Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal
Representation
- URL: http://arxiv.org/abs/2305.19798v2
- Date: Tue, 5 Dec 2023 09:26:05 GMT
- Title: Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal
Representation
- Authors: Yingyi Chen, Qinghua Tao, Francesco Tonin, Johan A.K. Suykens
- Abstract summary: We provide a new perspective to represent and optimize self-attention through asymmetric Kernel Singular Value Decomposition (KSVD)
We show that KSVD optimization can be implemented by simply minimizing a regularization loss, so that low-rank property is promoted without extra decomposition.
This is the first work that provides a primal-dual representation for the asymmetric kernel in self-attention and successfully applies it to modeling and optimization.
- Score: 21.87428356353377
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recently, a new line of works has emerged to understand and improve
self-attention in Transformers by treating it as a kernel machine. However,
existing works apply the methods for symmetric kernels to the asymmetric
self-attention, resulting in a nontrivial gap between the analytical
understanding and numerical implementation. In this paper, we provide a new
perspective to represent and optimize self-attention through asymmetric Kernel
Singular Value Decomposition (KSVD), which is also motivated by the low-rank
property of self-attention normally observed in deep layers. Through asymmetric
KSVD, $i$) a primal-dual representation of self-attention is formulated, where
the optimization objective is cast to maximize the projection variances in the
attention outputs; $ii$) a novel attention mechanism, i.e., Primal-Attention,
is proposed via the primal representation of KSVD, avoiding explicit
computation of the kernel matrix in the dual; $iii$) with KKT conditions, we
prove that the stationary solution to the KSVD optimization in Primal-Attention
yields a zero-value objective. In this manner, KSVD optimization can be
implemented by simply minimizing a regularization loss, so that low-rank
property is promoted without extra decomposition. Numerical experiments show
state-of-the-art performance of our Primal-Attention with improved efficiency.
Moreover, we demonstrate that the deployed KSVD optimization regularizes
Primal-Attention with a sharper singular value decay than that of the canonical
self-attention, further verifying the great potential of our method. To the
best of our knowledge, this is the first work that provides a primal-dual
representation for the asymmetric kernel in self-attention and successfully
applies it to modeling and optimization.
Related papers
- Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization.
A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR.
For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z) - Learning in Feature Spaces via Coupled Covariances: Asymmetric Kernel SVD and Nyström method [21.16129116282759]
We introduce a new asymmetric learning paradigm based on coupled covariance eigenproblem (CCE)
We formalize the asymmetric Nystr"om method through a finite sample approximation to speed up training.
arXiv Detail & Related papers (2024-06-13T02:12:18Z) - Implicit Bias and Fast Convergence Rates for Self-attention [30.08303212679308]
Self-attention, the core mechanism of transformers, distinguishes them from traditional neural networks and drives their outstanding performance.
We investigate the implicit bias of gradient descent (GD) in training a self-attention layer with fixed linear decoder in binary.
We provide the first finite-time convergence rate for $W_t$ to $W_mm$, along with the rate of sparsification in the attention map.
arXiv Detail & Related papers (2024-02-08T15:15:09Z) - Self-Attention through Kernel-Eigen Pair Sparse Variational Gaussian Processes [20.023544206079304]
We propose Kernel-Eigen Pair Sparse Variational Gaussian Processes (KEP-SVGP) for building uncertainty-aware self-attention.
Experiments verify our excellent performances and efficiency on in-distribution, distribution-shift and out-of-distribution benchmarks.
arXiv Detail & Related papers (2024-02-02T15:05:13Z) - Algorithmic Regularization in Tensor Optimization: Towards a Lifted
Approach in Matrix Sensing [28.295987262533075]
Gradient descent (GD) is crucial for generalization in machine learning models.
We show that GD induces implicit regularization, promoting compact representations.
Our findings underscore the significance of tensormet pararization of matrix sensing.
arXiv Detail & Related papers (2023-10-24T06:40:26Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem.
We characterize the implicit bias of 1-layer transformers optimized with gradient descent.
We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z) - Unraveling Attention via Convex Duality: Analysis and Interpretations of
Vision Transformers [52.468311268601056]
This paper analyzes attention through the lens of convex duality.
We derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality.
We show how self-attention networks implicitly cluster the tokens, based on their latent similarity.
arXiv Detail & Related papers (2022-05-17T04:01:15Z) - On the Efficient Implementation of the Matrix Exponentiated Gradient
Algorithm for Low-Rank Matrix Optimization [26.858608065417663]
Convex optimization over the spectrahedron has important applications in machine learning, signal processing and statistics.
We propose efficient implementations of MEG, which are tailored for optimization with low-rank matrices, and only use a single low-rank SVD on each iteration.
We also provide efficiently-computable certificates for the correct convergence of our methods.
arXiv Detail & Related papers (2020-12-18T19:14:51Z) - Understanding Implicit Regularization in Over-Parameterized Single Index
Model [55.41685740015095]
We design regularization-free algorithms for the high-dimensional single index model.
We provide theoretical guarantees for the induced implicit regularization phenomenon.
arXiv Detail & Related papers (2020-07-16T13:27:47Z) - Controllable Orthogonalization in Training DNNs [96.1365404059924]
Orthogonality is widely used for training deep neural networks (DNNs) due to its ability to maintain all singular values of the Jacobian close to 1.
This paper proposes a computationally efficient and numerically stable orthogonalization method using Newton's iteration (ONI)
We show that our method improves the performance of image classification networks by effectively controlling the orthogonality to provide an optimal tradeoff between optimization benefits and representational capacity reduction.
We also show that ONI stabilizes the training of generative adversarial networks (GANs) by maintaining the Lipschitz continuity of a network, similar to spectral normalization (
arXiv Detail & Related papers (2020-04-02T10:14:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.