Linear Self-Attention Approximation via Trainable Feedforward Kernel
- URL: http://arxiv.org/abs/2211.04076v1
- Date: Tue, 8 Nov 2022 08:14:11 GMT
- Title: Linear Self-Attention Approximation via Trainable Feedforward Kernel
- Authors: Uladzislau Yorsh and Alexander Kovalenko
- Abstract summary: In pursuit of faster computation, Efficient Transformers demonstrate an impressive variety of approaches.
We aim to expand the idea of trainable kernel methods to approximate the self-attention mechanism of the Transformer architecture.
- Score: 77.34726150561087
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In pursuit of faster computation, Efficient Transformers demonstrate an
impressive variety of approaches -- models attaining sub-quadratic attention
complexity can utilize a notion of sparsity or a low-rank approximation of
inputs to reduce the number of attended keys; other ways to reduce complexity
include locality-sensitive hashing, key pooling, additional memory to store
information in compacted or hybridization with other architectures, such as
CNN. Often based on a strong mathematical basis, kernelized approaches allow
for the approximation of attention with linear complexity while retaining high
accuracy. Therefore, in the present paper, we aim to expand the idea of
trainable kernel methods to approximate the self-attention mechanism of the
Transformer architecture.
Related papers
- Center-Sensitive Kernel Optimization for Efficient On-Device Incremental Learning [88.78080749909665]
Current on-device training methods just focus on efficient training without considering the catastrophic forgetting.
This paper proposes a simple but effective edge-friendly incremental learning framework.
Our method achieves average accuracy boost of 38.08% with even less memory and approximate computation.
arXiv Detail & Related papers (2024-06-13T05:49:29Z) - Fast Dual Subgradient Optimization of the Integrated Transportation
Distance Between Stochastic Kernels [1.5229257192293204]
A generalization of the Wasserstein metric, the integrated transportation distance, establishes a novel distance between probability kernels of Markov systems.
This metric serves as the foundation for an efficient approximation technique, enabling the replacement of the original system's kernel with a kernel with a discrete support of limited cardinality.
We present a specialized dual algorithm capable of constructing these approximate kernels quickly and efficiently, without requiring computationally expensive matrix operations.
arXiv Detail & Related papers (2023-12-03T15:44:17Z) - Dynamic Kernel-Based Adaptive Spatial Aggregation for Learned Image
Compression [63.56922682378755]
We focus on extending spatial aggregation capability and propose a dynamic kernel-based transform coding.
The proposed adaptive aggregation generates kernel offsets to capture valid information in the content-conditioned range to help transform.
Experimental results demonstrate that our method achieves superior rate-distortion performance on three benchmarks compared to the state-of-the-art learning-based methods.
arXiv Detail & Related papers (2023-08-17T01:34:51Z) - Reconstructing Kernel-based Machine Learning Force Fields with
Super-linear Convergence [0.18416014644193063]
We consider the broad class of Nystr"om-type methods to construct preconditioners.
All considered methods aim to identify a representative subset of inducing ( Kernel) columns to approximate the dominant kernel spectrum.
arXiv Detail & Related papers (2022-12-24T13:45:50Z) - A Recursively Recurrent Neural Network (R2N2) Architecture for Learning
Iterative Algorithms [64.3064050603721]
We generalize Runge-Kutta neural network to a recurrent neural network (R2N2) superstructure for the design of customized iterative algorithms.
We demonstrate that regular training of the weight parameters inside the proposed superstructure on input/output data of various computational problem classes yields similar iterations to Krylov solvers for linear equation systems, Newton-Krylov solvers for nonlinear equation systems, and Runge-Kutta solvers for ordinary differential equations.
arXiv Detail & Related papers (2022-11-22T16:30:33Z) - Semi-Parametric Inducing Point Networks and Neural Processes [15.948270454686197]
Semi-parametric inducing point networks (SPIN) can query the training set at inference time in a compute-efficient manner.
SPIN attains linear complexity via a cross-attention mechanism between datapoints inspired by inducing point methods.
In our experiments, SPIN reduces memory requirements, improves accuracy across a range of meta-learning tasks, and improves state-of-the-art performance on an important practical problem, genotype imputation.
arXiv Detail & Related papers (2022-05-24T01:42:46Z) - Inducing Gaussian Process Networks [80.40892394020797]
We propose inducing Gaussian process networks (IGN), a simple framework for simultaneously learning the feature space as well as the inducing points.
The inducing points, in particular, are learned directly in the feature space, enabling a seamless representation of complex structured domains.
We report on experimental results for real-world data sets showing that IGNs provide significant advances over state-of-the-art methods.
arXiv Detail & Related papers (2022-04-21T05:27:09Z) - Deep Equilibrium Assisted Block Sparse Coding of Inter-dependent
Signals: Application to Hyperspectral Imaging [71.57324258813675]
A dataset of inter-dependent signals is defined as a matrix whose columns demonstrate strong dependencies.
A neural network is employed to act as structure prior and reveal the underlying signal interdependencies.
Deep unrolling and Deep equilibrium based algorithms are developed, forming highly interpretable and concise deep-learning-based architectures.
arXiv Detail & Related papers (2022-03-29T21:00:39Z) - Factorized Gaussian Process Variational Autoencoders [6.866104126509981]
Variational autoencoders often assume isotropic Gaussian priors and mean-field posteriors, hence do not exploit structure in scenarios where we may expect similarity or consistency across latent variables.
We propose a more scalable extension of these models by leveraging the independence of the auxiliary features, which is present in many datasets.
arXiv Detail & Related papers (2020-11-14T10:24:10Z) - Relative gradient optimization of the Jacobian term in unsupervised deep
learning [9.385902422987677]
Learning expressive probabilistic models correctly describing the data is a ubiquitous problem in machine learning.
Deep density models have been widely used for this task, but their maximum likelihood based training requires estimating the log-determinant of the Jacobian.
We propose a new approach for exact training of such neural networks.
arXiv Detail & Related papers (2020-06-26T16:41:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.