FAVOR#: Sharp Attention Kernel Approximations via New Classes of
Positive Random Features
- URL: http://arxiv.org/abs/2302.00787v1
- Date: Wed, 1 Feb 2023 22:43:29 GMT
- Title: FAVOR#: Sharp Attention Kernel Approximations via New Classes of
Positive Random Features
- Authors: Valerii Likhosherstov, Krzysztof Choromanski, Avinava Dubey, Frederick
Liu, Tamas Sarlos, Adrian Weller
- Abstract summary: We propose parameterized, positive, non-trigonometric RFs which approximate Gaussian and softmax- Kernels.
We show that our methods lead to variance reduction in practice and outperform previous methods in a kernel regression task.
We also present FAVOR#, a method for self-attention approximation in Transformers.
- Score: 39.282051468586666
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The problem of efficient approximation of a linear operator induced by the
Gaussian or softmax kernel is often addressed using random features (RFs) which
yield an unbiased approximation of the operator's result. Such operators emerge
in important applications ranging from kernel methods to efficient
Transformers. We propose parameterized, positive, non-trigonometric RFs which
approximate Gaussian and softmax-kernels. In contrast to traditional RF
approximations, parameters of these new methods can be optimized to reduce the
variance of the approximation, and the optimum can be expressed in closed form.
We show that our methods lead to variance reduction in practice ($e^{10}$-times
smaller variance and beyond) and outperform previous methods in a kernel
regression task. Using our proposed mechanism, we also present FAVOR#, a method
for self-attention approximation in Transformers. We show that FAVOR#
outperforms other random feature methods in speech modelling and natural
language processing.
Related papers
- Sample-efficient Bayesian Optimisation Using Known Invariances [56.34916328814857]
We show that vanilla and constrained BO algorithms are inefficient when optimising invariant objectives.
We derive a bound on the maximum information gain of these invariant kernels.
We use our method to design a current drive system for a nuclear fusion reactor, finding a high-performance solution.
arXiv Detail & Related papers (2024-10-22T12:51:46Z) - Variance-Reducing Couplings for Random Features [57.73648780299374]
Random features (RFs) are a popular technique to scale up kernel methods in machine learning.
We find couplings to improve RFs defined on both Euclidean and discrete input spaces.
We reach surprising conclusions about the benefits and limitations of variance reduction as a paradigm.
arXiv Detail & Related papers (2024-05-26T12:25:09Z) - Scalable Variational Gaussian Processes via Harmonic Kernel
Decomposition [54.07797071198249]
We introduce a new scalable variational Gaussian process approximation which provides a high fidelity approximation while retaining general applicability.
We demonstrate that, on a range of regression and classification problems, our approach can exploit input space symmetries such as translations and reflections.
Notably, our approach achieves state-of-the-art results on CIFAR-10 among pure GP models.
arXiv Detail & Related papers (2021-06-10T18:17:57Z) - Towards Unbiased Random Features with Lower Variance For Stationary
Indefinite Kernels [26.57122949130266]
Our algorithm achieves lower variance and approximation error compared with the existing kernel approximation methods.
With better approximation to the originally selected kernels, improved classification accuracy and regression ability is obtained.
arXiv Detail & Related papers (2021-04-13T13:56:50Z) - Global Optimization of Gaussian processes [52.77024349608834]
We propose a reduced-space formulation with trained Gaussian processes trained on few data points.
The approach also leads to significantly smaller and computationally cheaper sub solver for lower bounding.
In total, we reduce time convergence by orders of orders of the proposed method.
arXiv Detail & Related papers (2020-05-21T20:59:11Z) - SLEIPNIR: Deterministic and Provably Accurate Feature Expansion for
Gaussian Process Regression with Derivatives [86.01677297601624]
We propose a novel approach for scaling GP regression with derivatives based on quadrature Fourier features.
We prove deterministic, non-asymptotic and exponentially fast decaying error bounds which apply for both the approximated kernel as well as the approximated posterior.
arXiv Detail & Related papers (2020-03-05T14:33:20Z) - Certified and fast computations with shallow covariance kernels [0.0]
We introduce and analyze a new and certified algorithm for the low-rank approximation of a parameterized family of covariance operators.
The proposed algorithm provides the basis of a fast sampling procedure for parameter dependent random fields.
arXiv Detail & Related papers (2020-01-24T20:28:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.