Related papers: Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr\"om Method

Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr\"om Method

URL: http://arxiv.org/abs/2111.00035v1
Date: Fri, 29 Oct 2021 18:28:49 GMT
Title: Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr\"om Method
Authors: Yifan Chen, Qi Zeng, Heng Ji, Yun Yang
Abstract summary: We introduce Skyformer, which replaces the softmax structure with a Gaussian kernel to stabilize the model training and adapts the Nystr"om method to accelerate the computation. Experiments on Long Range Arena benchmark show that the proposed method is sufficient in getting comparable or even better performance than the full self-attention.
Score: 35.62926659320816
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers are expensive to train due to the quadratic time and space complexity in the self-attention mechanism. On the other hand, although kernel machines suffer from the same computation bottleneck in pairwise dot products, several approximation schemes have been successfully incorporated to considerably reduce their computational cost without sacrificing too much accuracy. In this work, we leverage the computation methods for kernel machines to alleviate the high computational cost and introduce Skyformer, which replaces the softmax structure with a Gaussian kernel to stabilize the model training and adapts the Nystr\"om method to a non-positive semidefinite matrix to accelerate the computation. We further conduct theoretical analysis by showing that the matrix approximation error of our proposed method is small in the spectral norm. Experiments on Long Range Arena benchmark show that the proposed method is sufficient in getting comparable or even better performance than the full self-attention while requiring fewer computation resources.

Related papers

Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties. We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z)
Reconstructing Kernel-based Machine Learning Force Fields with Super-linear Convergence [0.18416014644193063]
We consider the broad class of Nystr"om-type methods to construct preconditioners. All considered methods aim to identify a representative subset of inducing ( Kernel) columns to approximate the dominant kernel spectrum.
arXiv Detail & Related papers (2022-12-24T13:45:50Z)
Linear Self-Attention Approximation via Trainable Feedforward Kernel [77.34726150561087]
In pursuit of faster computation, Efficient Transformers demonstrate an impressive variety of approaches. We aim to expand the idea of trainable kernel methods to approximate the self-attention mechanism of the Transformer architecture.
arXiv Detail & Related papers (2022-11-08T08:14:11Z)
Faster One-Sample Stochastic Conditional Gradient Method for Composite Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms. The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z)
Improved Convergence Rates for Sparse Approximation Methods in Kernel-Based Learning [48.08663378234329]
Kernel-based models such as kernel ridge regression and Gaussian processes are ubiquitous in machine learning applications. Existing sparse approximation methods can yield a significant reduction in the computational cost. We provide novel confidence intervals for the Nystr"om method and the sparse variational Gaussian processes approximation method.
arXiv Detail & Related papers (2022-02-08T17:22:09Z)
Linear Time Kernel Matrix Approximation via Hyperspherical Harmonics [3.24890820102255]
We propose a new technique for constructing low-rank approximations of matrices that arise in kernel methods for machine learning. Our approach pairs a novel automatically constructed analytic expansion of the underlying kernel function with a data-dependent compression step to further optimize the approximation. Experimental results show our approach compares favorably to the commonly used Nystrom method with respect to both accuracy for a given rank and computational time for a given accuracy across a variety of kernels, dimensions, and datasets.
arXiv Detail & Related papers (2022-02-08T05:19:39Z)
Reducing the Variance of Gaussian Process Hyperparameter Optimization with Preconditioning [54.01682318834995]
Preconditioning is a highly effective step for any iterative method involving matrix-vector multiplication. We prove that preconditioning has an additional benefit that has been previously unexplored. It simultaneously can reduce variance at essentially negligible cost.
arXiv Detail & Related papers (2021-07-01T06:43:11Z)
The Fast Kernel Transform [21.001203328543006]
We propose the Fast Kernel Transform (FKT), a general algorithm to compute matrix-vector multiplications for datasets in moderate dimensions with quasilinear complexity. The FKT is easily applied to a broad class of kernels, including Gaussian, Matern, and Rational Quadratic covariance functions and physically motivated Green's functions. We illustrate the efficacy and versatility of the FKT by providing timing and accuracy benchmarks and by applying it to scale the neighborhood embedding (t-SNE) and Gaussian processes to large real-world data sets.
arXiv Detail & Related papers (2021-06-08T16:15:47Z)
Global Optimization of Gaussian processes [52.77024349608834]
We propose a reduced-space formulation with trained Gaussian processes trained on few data points. The approach also leads to significantly smaller and computationally cheaper sub solver for lower bounding. In total, we reduce time convergence by orders of orders of the proposed method.
arXiv Detail & Related papers (2020-05-21T20:59:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.