Spectraformer: A Unified Random Feature Framework for Transformer
- URL: http://arxiv.org/abs/2405.15310v3
- Date: Wed, 23 Oct 2024 04:08:23 GMT
- Title: Spectraformer: A Unified Random Feature Framework for Transformer
- Authors: Duke Nguyen, Aditya Joshi, Flora Salim,
- Abstract summary: We introduce Spectraformer, a unified framework for approximating and learning the kernel function in linearized attention of the Transformer.
Our empirical findings indicate that different kernels are good at different tasks and that kernel choice is fundamental to performant models.
- Score: 2.8514881296685113
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Linearization of attention using various kernel approximation and kernel learning techniques has shown promise. Past methods use a subset of combinations of component functions and weight matrices within the random features paradigm. We identify the need for a systematic comparison of different combinations of weight matrices and component functions for attention learning in Transformer. In this work, we introduce Spectraformer, a unified framework for approximating and learning the kernel function in linearized attention of the Transformer. We experiment with broad classes of component functions and weight matrices for three textual tasks in the LRA benchmark. Our empirical findings indicate that different kernels are good at different tasks and that kernel choice is fundamental to performant models. Our code is available at: https://github.com/dukenguyenxyz/spectraformer .
Related papers
- Feature maps for the Laplacian kernel and its generalizations [3.671202973761375]
Unlike the Gaussian kernel, the Laplacian kernel is not separable.
We provide random features for the Laplacian kernel and its two generalizations.
We demonstrate the efficacy of these random feature maps on real datasets.
arXiv Detail & Related papers (2025-02-21T16:36:20Z) - New random projections for isotropic kernels using stable spectral distributions [0.0]
We decompose spectral kernel distributions as a scale mixture of $alpha$-stable random vectors.
Results have broad applications for support vector machines, kernel ridge regression, and other kernel-based machine learning techniques.
arXiv Detail & Related papers (2024-11-05T03:28:01Z) - Sample-efficient Bayesian Optimisation Using Known Invariances [56.34916328814857]
We show that vanilla and constrained BO algorithms are inefficient when optimising invariant objectives.
We derive a bound on the maximum information gain of these invariant kernels.
We use our method to design a current drive system for a nuclear fusion reactor, finding a high-performance solution.
arXiv Detail & Related papers (2024-10-22T12:51:46Z) - Variance-Reducing Couplings for Random Features [57.73648780299374]
Random features (RFs) are a popular technique to scale up kernel methods in machine learning.
We find couplings to improve RFs defined on both Euclidean and discrete input spaces.
We reach surprising conclusions about the benefits and limitations of variance reduction as a paradigm.
arXiv Detail & Related papers (2024-05-26T12:25:09Z) - EulerFormer: Sequential User Behavior Modeling with Complex Vector Attention [88.45459681677369]
We propose a novel transformer variant with complex vector attention, named EulerFormer.
It provides a unified theoretical framework to formulate both semantic difference and positional difference.
It is more robust to semantic variations and possesses moresuperior theoretical properties in principle.
arXiv Detail & Related papers (2024-03-26T14:18:43Z) - Equivariant Scalar Fields for Molecular Docking with Fast Fourier Transforms [11.163940886337798]
We show how machine learning can learn a scoring function with a functional form that allows for more rapid optimization.
We benchmark our scoring functions on two simplified docking-related tasks: decoy pose scoring and rigid conformer docking.
Our method attains similar but faster performance on crystal structures compared to the widely-used Vina and Gnina scoring functions, and is more robust on predicted structures.
arXiv Detail & Related papers (2023-12-07T14:32:32Z) - On the Identifiability and Interpretability of Gaussian Process Models [8.417178903130244]
We critically examine the prevalent practice of using additive mixtures of Mat'ern kernels in single-output Gaussian process (GP) models.
We show that the smoothness of a mixture of Mat'ern kernels is determined by the least smooth component and that a GP with such a kernel is effectively equivalent to the least smooth kernel component.
We show that $A$ is identifiable up to a multiplicative constant, suggesting that multiplicative mixtures are well suited for multi-output tasks.
arXiv Detail & Related papers (2023-10-25T22:00:29Z) - Transformers as Statisticians: Provable In-Context Learning with
In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL.
We show that transformers can implement a broad class of standard machine learning algorithms in context.
A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z) - Doubly Deformable Aggregation of Covariance Matrices for Few-shot
Segmentation [25.387090319723715]
Training semantic segmentation models with few annotated samples has great potential in various real-world applications.
For the few-shot segmentation task, the main challenge is how to accurately measure the semantic correspondence between the support and query samples.
We propose to aggregate the learnable covariance matrices with a deformable 4D Transformer to effectively predict the segmentation map.
arXiv Detail & Related papers (2022-07-30T20:41:38Z) - Transformer with Fourier Integral Attentions [18.031977028559282]
We propose a new class of transformers in which the dot-product kernels are replaced by the novel generalized Fourier integral kernels.
Compared to the conventional transformers with dot-product attention, FourierFormers attain better accuracy and reduce the redundancy between attention heads.
We empirically corroborate the advantages of FourierFormers over the baseline transformers in a variety of practical applications including language modeling and image classification.
arXiv Detail & Related papers (2022-06-01T03:06:21Z) - Geometry-aware Bayesian Optimization in Robotics using Riemannian
Mat\'ern Kernels [64.62221198500467]
We show how to implement geometry-aware kernels for Bayesian optimization.
This technique can be used for control parameter tuning, parametric policy adaptation, and structure design in robotics.
arXiv Detail & Related papers (2021-11-02T09:47:22Z) - On Learning the Transformer Kernel [13.955526058823166]
KERNELIZED TRANSFORMER is a generic, scalable, data driven framework for learning the kernel function in Transformers.
Our framework approximates the Transformer kernel as a dot product between spectral feature maps and learns the kernel by learning the spectral distribution.
arXiv Detail & Related papers (2021-10-15T19:20:25Z) - Kernel Continual Learning [117.79080100313722]
kernel continual learning is a simple but effective variant of continual learning to tackle catastrophic forgetting.
episodic memory unit stores a subset of samples for each task to learn task-specific classifiers based on kernel ridge regression.
variational random features to learn a data-driven kernel for each task.
arXiv Detail & Related papers (2021-07-12T22:09:30Z) - Kernel Identification Through Transformers [54.3795894579111]
Kernel selection plays a central role in determining the performance of Gaussian Process (GP) models.
This work addresses the challenge of constructing custom kernel functions for high-dimensional GP regression models.
We introduce a novel approach named KITT: Kernel Identification Through Transformers.
arXiv Detail & Related papers (2021-06-15T14:32:38Z) - MetaKernel: Learning Variational Random Features with Limited Labels [120.90737681252594]
Few-shot learning deals with the fundamental and challenging problem of learning from a few annotated samples, while being able to generalize well on new tasks.
We propose meta-learning kernels with random Fourier features for few-shot learning, we call Meta Kernel.
arXiv Detail & Related papers (2021-05-08T21:24:09Z) - Function Approximation via Sparse Random Features [23.325877475827337]
This paper introduces the sparse random feature method that learns parsimonious random feature models utilizing techniques from compressive sensing.
We show that the sparse random feature method outperforms shallow networks for well-structured functions and applications to scientific machine learning tasks.
arXiv Detail & Related papers (2021-03-04T17:53:54Z) - Learning to Learn Kernels with Variational Random Features [118.09565227041844]
We introduce kernels with random Fourier features in the meta-learning framework to leverage their strong few-shot learning ability.
We formulate the optimization of MetaVRF as a variational inference problem.
We show that MetaVRF delivers much better, or at least competitive, performance compared to existing meta-learning alternatives.
arXiv Detail & Related papers (2020-06-11T18:05:29Z) - Spectral Learning on Matrices and Tensors [74.88243719463053]
We show that tensor decomposition can pick up latent effects that are missed by matrix methods.
We also outline computational techniques to design efficient tensor decomposition methods.
arXiv Detail & Related papers (2020-04-16T22:53:00Z) - Scaling up Kernel Ridge Regression via Locality Sensitive Hashing [6.704115928005158]
We introduce a weighted version of random binning features and show that the corresponding kernel function generates smooth Gaussian processes.
We show that our weighted random binning features provide a spectral approximation to the corresponding kernel matrix, leading to efficient algorithms for kernel ridge regression.
arXiv Detail & Related papers (2020-03-21T21:41:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.