Self-Attention through Kernel-Eigen Pair Sparse Variational Gaussian Processes
- URL: http://arxiv.org/abs/2402.01476v2
- Date: Tue, 28 May 2024 09:13:19 GMT
- Title: Self-Attention through Kernel-Eigen Pair Sparse Variational Gaussian Processes
- Authors: Yingyi Chen, Qinghua Tao, Francesco Tonin, Johan A. K. Suykens,
- Abstract summary: We propose Kernel-Eigen Pair Sparse Variational Gaussian Processes (KEP-SVGP) for building uncertainty-aware self-attention.
Experiments verify our excellent performances and efficiency on in-distribution, distribution-shift and out-of-distribution benchmarks.
- Score: 20.023544206079304
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: While the great capability of Transformers significantly boosts prediction accuracy, it could also yield overconfident predictions and require calibrated uncertainty estimation, which can be commonly tackled by Gaussian processes (GPs). Existing works apply GPs with symmetric kernels under variational inference to the attention kernel; however, omitting the fact that attention kernels are in essence asymmetric. Moreover, the complexity of deriving the GP posteriors remains high for large-scale data. In this work, we propose Kernel-Eigen Pair Sparse Variational Gaussian Processes (KEP-SVGP) for building uncertainty-aware self-attention where the asymmetry of attention kernels is tackled by Kernel SVD (KSVD) and a reduced complexity is acquired. Through KEP-SVGP, i) the SVGP pair induced by the two sets of singular vectors from KSVD w.r.t. the attention kernel fully characterizes the asymmetry; ii) using only a small set of adjoint eigenfunctions from KSVD, the derivation of SVGP posteriors can be based on the inversion of a diagonal matrix containing singular values, contributing to a reduction in time complexity; iii) an evidence lower bound is derived so that variational parameters and network weights can be optimized with it. Experiments verify our excellent performances and efficiency on in-distribution, distribution-shift and out-of-distribution benchmarks.
Related papers
- Learning in Feature Spaces via Coupled Covariances: Asymmetric Kernel SVD and Nyström method [21.16129116282759]
We introduce a new asymmetric learning paradigm based on coupled covariance eigenproblem (CCE)
We formalize the asymmetric Nystr"om method through a finite sample approximation to speed up training.
arXiv Detail & Related papers (2024-06-13T02:12:18Z) - Variance-Reducing Couplings for Random Features [57.73648780299374]
Random features (RFs) are a popular technique to scale up kernel methods in machine learning.
We find couplings to improve RFs defined on both Euclidean and discrete input spaces.
We reach surprising conclusions about the benefits and limitations of variance reduction as a paradigm.
arXiv Detail & Related papers (2024-05-26T12:25:09Z) - Neural Tangent Kernels Motivate Graph Neural Networks with
Cross-Covariance Graphs [94.44374472696272]
We investigate NTKs and alignment in the context of graph neural networks (GNNs)
Our results establish the theoretical guarantees on the optimality of the alignment for a two-layer GNN.
These guarantees are characterized by the graph shift operator being a function of the cross-covariance between the input and the output data.
arXiv Detail & Related papers (2023-10-16T19:54:21Z) - Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem.
We characterize the implicit bias of 1-layer transformers optimized with gradient descent.
We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z) - Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal
Representation [21.87428356353377]
We provide a new perspective to represent and optimize self-attention through asymmetric Kernel Singular Value Decomposition (KSVD)
We show that KSVD optimization can be implemented by simply minimizing a regularization loss, so that low-rank property is promoted without extra decomposition.
This is the first work that provides a primal-dual representation for the asymmetric kernel in self-attention and successfully applies it to modeling and optimization.
arXiv Detail & Related papers (2023-05-31T12:38:24Z) - Towards Understanding the Dynamics of Gaussian-Stein Variational
Gradient Descent [16.16051064618816]
Stein Variational Gradient Descent (SVGD) is a nonparametric particle-based deterministic sampling algorithm.
We study the dynamics of the Gaussian-SVGD projected to the family of Gaussian distributions via the bilinear kernel.
We propose a density-based and a particle-based implementation of the Gaussian-SVGD, and show that several recent algorithms for GVI, proposed from different perspectives, emerge as special cases of our unified framework.
arXiv Detail & Related papers (2023-05-23T13:55:47Z) - A Note on Optimizing Distributions using Kernel Mean Embeddings [94.96262888797257]
Kernel mean embeddings represent probability measures by their infinite-dimensional mean embeddings in a reproducing kernel Hilbert space.
We show that when the kernel is characteristic, distributions with a kernel sum-of-squares density are dense.
We provide algorithms to optimize such distributions in the finite-sample setting.
arXiv Detail & Related papers (2021-06-18T08:33:45Z) - Scalable Variational Gaussian Processes via Harmonic Kernel
Decomposition [54.07797071198249]
We introduce a new scalable variational Gaussian process approximation which provides a high fidelity approximation while retaining general applicability.
We demonstrate that, on a range of regression and classification problems, our approach can exploit input space symmetries such as translations and reflections.
Notably, our approach achieves state-of-the-art results on CIFAR-10 among pure GP models.
arXiv Detail & Related papers (2021-06-10T18:17:57Z) - High-Dimensional Gaussian Process Inference with Derivatives [90.8033626920884]
We show that in the low-data regime $ND$, the Gram matrix can be decomposed in a manner that reduces the cost of inference to $mathcalO(N2D + (N2)3)$.
We demonstrate this potential in a variety of tasks relevant for machine learning, such as optimization and Hamiltonian Monte Carlo with predictive gradients.
arXiv Detail & Related papers (2021-02-15T13:24:41Z) - tvGP-VAE: Tensor-variate Gaussian Process Prior Variational Autoencoder [0.0]
tvGP-VAE is able to explicitly model correlation via the use of kernel functions.
We show that the choice of which correlation structures to explicitly represent in the latent space has a significant impact on model performance.
arXiv Detail & Related papers (2020-06-08T17:59:13Z) - Randomly Projected Additive Gaussian Processes for Regression [37.367935314532154]
We use additive sums of kernels for GP regression, where each kernel operates on a different random projection of its inputs.
We prove this convergence and its rate, and propose a deterministic approach that converges more quickly than purely random projections.
arXiv Detail & Related papers (2019-12-30T07:26:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.