Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr\"om
Method
- URL: http://arxiv.org/abs/2111.00035v1
- Date: Fri, 29 Oct 2021 18:28:49 GMT
- Title: Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr\"om
Method
- Authors: Yifan Chen, Qi Zeng, Heng Ji, Yun Yang
- Abstract summary: We introduce Skyformer, which replaces the softmax structure with a Gaussian kernel to stabilize the model training and adapts the Nystr"om method to accelerate the computation.
Experiments on Long Range Arena benchmark show that the proposed method is sufficient in getting comparable or even better performance than the full self-attention.
- Score: 35.62926659320816
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers are expensive to train due to the quadratic time and space
complexity in the self-attention mechanism. On the other hand, although kernel
machines suffer from the same computation bottleneck in pairwise dot products,
several approximation schemes have been successfully incorporated to
considerably reduce their computational cost without sacrificing too much
accuracy. In this work, we leverage the computation methods for kernel machines
to alleviate the high computational cost and introduce Skyformer, which
replaces the softmax structure with a Gaussian kernel to stabilize the model
training and adapts the Nystr\"om method to a non-positive semidefinite matrix
to accelerate the computation. We further conduct theoretical analysis by
showing that the matrix approximation error of our proposed method is small in
the spectral norm. Experiments on Long Range Arena benchmark show that the
proposed method is sufficient in getting comparable or even better performance
than the full self-attention while requiring fewer computation resources.
Related papers
- Stochastic Optimization for Non-convex Problem with Inexact Hessian
Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties.
We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z) - Reconstructing Kernel-based Machine Learning Force Fields with
Super-linear Convergence [0.18416014644193063]
We consider the broad class of Nystr"om-type methods to construct preconditioners.
All considered methods aim to identify a representative subset of inducing ( Kernel) columns to approximate the dominant kernel spectrum.
arXiv Detail & Related papers (2022-12-24T13:45:50Z) - Linear Self-Attention Approximation via Trainable Feedforward Kernel [77.34726150561087]
In pursuit of faster computation, Efficient Transformers demonstrate an impressive variety of approaches.
We aim to expand the idea of trainable kernel methods to approximate the self-attention mechanism of the Transformer architecture.
arXiv Detail & Related papers (2022-11-08T08:14:11Z) - Faster One-Sample Stochastic Conditional Gradient Method for Composite
Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms.
The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z) - Improved Convergence Rates for Sparse Approximation Methods in
Kernel-Based Learning [48.08663378234329]
Kernel-based models such as kernel ridge regression and Gaussian processes are ubiquitous in machine learning applications.
Existing sparse approximation methods can yield a significant reduction in the computational cost.
We provide novel confidence intervals for the Nystr"om method and the sparse variational Gaussian processes approximation method.
arXiv Detail & Related papers (2022-02-08T17:22:09Z) - Linear Time Kernel Matrix Approximation via Hyperspherical Harmonics [3.24890820102255]
We propose a new technique for constructing low-rank approximations of matrices that arise in kernel methods for machine learning.
Our approach pairs a novel automatically constructed analytic expansion of the underlying kernel function with a data-dependent compression step to further optimize the approximation.
Experimental results show our approach compares favorably to the commonly used Nystrom method with respect to both accuracy for a given rank and computational time for a given accuracy across a variety of kernels, dimensions, and datasets.
arXiv Detail & Related papers (2022-02-08T05:19:39Z) - Reducing the Variance of Gaussian Process Hyperparameter Optimization
with Preconditioning [54.01682318834995]
Preconditioning is a highly effective step for any iterative method involving matrix-vector multiplication.
We prove that preconditioning has an additional benefit that has been previously unexplored.
It simultaneously can reduce variance at essentially negligible cost.
arXiv Detail & Related papers (2021-07-01T06:43:11Z) - The Fast Kernel Transform [21.001203328543006]
We propose the Fast Kernel Transform (FKT), a general algorithm to compute matrix-vector multiplications for datasets in moderate dimensions with quasilinear complexity.
The FKT is easily applied to a broad class of kernels, including Gaussian, Matern, and Rational Quadratic covariance functions and physically motivated Green's functions.
We illustrate the efficacy and versatility of the FKT by providing timing and accuracy benchmarks and by applying it to scale the neighborhood embedding (t-SNE) and Gaussian processes to large real-world data sets.
arXiv Detail & Related papers (2021-06-08T16:15:47Z) - Global Optimization of Gaussian processes [52.77024349608834]
We propose a reduced-space formulation with trained Gaussian processes trained on few data points.
The approach also leads to significantly smaller and computationally cheaper sub solver for lower bounding.
In total, we reduce time convergence by orders of orders of the proposed method.
arXiv Detail & Related papers (2020-05-21T20:59:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.