Implicit Regularization of Gradient Flow on One-Layer Softmax Attention
- URL: http://arxiv.org/abs/2403.08699v1
- Date: Wed, 13 Mar 2024 17:02:27 GMT
- Title: Implicit Regularization of Gradient Flow on One-Layer Softmax Attention
- Authors: Heejune Sheen, Siyu Chen, Tianhao Wang, Harrison H. Zhou
- Abstract summary: We study gradient flow on the exponential loss for a classification problem with a one-layer softmax attention model.
Under a separability assumption on the data, we show that when gradient flow achieves the minimal loss value, it further implicitly minimizes the nuclear norm of the product of the key and query weight matrices.
- Score: 10.060496091806694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study gradient flow on the exponential loss for a classification problem
with a one-layer softmax attention model, where the key and query weight
matrices are trained separately. Under a separability assumption on the data,
we show that when gradient flow achieves the minimal loss value, it further
implicitly minimizes the nuclear norm of the product of the key and query
weight matrices. Such implicit regularization can be described by a Support
Vector Machine (SVM) problem with respect to the attention weights. This
finding contrasts with prior results showing that the gradient descent induces
an implicit regularization on the Frobenius norm on the product weight matrix
when the key and query matrices are combined into a single weight matrix for
training. For diagonal key and query matrices, our analysis builds upon the
reparameterization technique and exploits approximate KKT conditions of the SVM
associated with the classification data. Moreover, the results are extended to
general weights configurations given proper alignment of the weight matrices'
singular spaces with the data features at initialization.
Related papers
- Entrywise error bounds for low-rank approximations of kernel matrices [55.524284152242096]
We derive entrywise error bounds for low-rank approximations of kernel matrices obtained using the truncated eigen-decomposition.
A key technical innovation is a delocalisation result for the eigenvectors of the kernel matrix corresponding to small eigenvalues.
We validate our theory with an empirical study of a collection of synthetic and real-world datasets.
arXiv Detail & Related papers (2024-05-23T12:26:25Z) - Asymmetric matrix sensing by gradient descent with small random
initialization [0.8611782340880084]
We study the problem of reconstructing a low-rank matrix from a few linear measurements.
Our key contribution is introducing a continuous gradient flow equation that we call the $texted gradient flow$.
arXiv Detail & Related papers (2023-09-04T20:23:35Z) - Large-scale gradient-based training of Mixtures of Factor Analyzers [67.21722742907981]
This article contributes both a theoretical analysis as well as a new method for efficient high-dimensional training by gradient descent.
We prove that MFA training and inference/sampling can be performed based on precision matrices, which does not require matrix inversions after training is completed.
Besides the theoretical analysis and matrices, we apply MFA to typical image datasets such as SVHN and MNIST, and demonstrate the ability to perform sample generation and outlier detection.
arXiv Detail & Related papers (2023-08-26T06:12:33Z) - The Inductive Bias of Flatness Regularization for Deep Matrix
Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks.
We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z) - Sufficient dimension reduction for feature matrices [3.04585143845864]
We propose a method called principal support matrix machine (PSMM) for the matrix sufficient dimension reduction.
Our numerical analysis demonstrates that the PSMM outperforms existing methods and has strong interpretability in real data applications.
arXiv Detail & Related papers (2023-03-07T23:16:46Z) - Graph Polynomial Convolution Models for Node Classification of
Non-Homophilous Graphs [52.52570805621925]
We investigate efficient learning from higher-order graph convolution and learning directly from adjacency matrix for node classification.
We show that the resulting model lead to new graphs and residual scaling parameter.
We demonstrate that the proposed methods obtain improved accuracy for node-classification of non-homophilous parameters.
arXiv Detail & Related papers (2022-09-12T04:46:55Z) - Semi-Supervised Subspace Clustering via Tensor Low-Rank Representation [64.49871502193477]
We propose a novel semi-supervised subspace clustering method, which is able to simultaneously augment the initial supervisory information and construct a discriminative affinity matrix.
Comprehensive experimental results on six commonly-used benchmark datasets demonstrate the superiority of our method over state-of-the-art methods.
arXiv Detail & Related papers (2022-05-21T01:47:17Z) - Learning a Compressive Sensing Matrix with Structural Constraints via
Maximum Mean Discrepancy Optimization [17.104994036477308]
We introduce a learning-based algorithm to obtain a measurement matrix for compressive sensing related recovery problems.
Recent success of such metrics in neural network related topics motivate a solution of the problem based on machine learning.
arXiv Detail & Related papers (2021-10-14T08:35:54Z) - Understanding Implicit Regularization in Over-Parameterized Single Index
Model [55.41685740015095]
We design regularization-free algorithms for the high-dimensional single index model.
We provide theoretical guarantees for the induced implicit regularization phenomenon.
arXiv Detail & Related papers (2020-07-16T13:27:47Z) - Efficient MCMC Sampling for Bayesian Matrix Factorization by Breaking
Posterior Symmetries [1.3858051019755282]
We propose a simple modification to the prior choice that provably breaks these symmetries and maintains/improves accuracy.
We show that using non-zero linearly independent prior means significantly lowers the autocorrelation of MCMC samples, and can also lead to lower reconstruction errors.
arXiv Detail & Related papers (2020-06-08T00:25:48Z) - Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled
Gradient Descent [34.0533596121548]
Low-rank matrix estimation converges convex problem that finds numerous applications in signal processing, machine learning and imaging science.
We show that ScaledGD achieves the best of the best in terms of the number of the low-rank matrix.
Our analysis is also applicable to general loss that are similar to low-rank gradient descent.
arXiv Detail & Related papers (2020-05-18T17:17:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.