KDEformer: Accelerating Transformers via Kernel Density Estimation
- URL: http://arxiv.org/abs/2302.02451v2
- Date: Thu, 29 Jun 2023 17:51:10 GMT
- Title: KDEformer: Accelerating Transformers via Kernel Density Estimation
- Authors: Amir Zandieh, Insu Han, Majid Daliri, Amin Karbasi
- Abstract summary: In this paper, we introduce a new approach to exact computation of Dot-product attention mechanism.
We show that our proposed approach outperforms other attention approximations in terms of accuracy, memory, and runtime.
For image classification with T2T-ViT, we show over $18times$ speedup while the accuracy drop is less than $0.5%$.
- Score: 30.860357184928407
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dot-product attention mechanism plays a crucial role in modern deep
architectures (e.g., Transformer) for sequence modeling, however, na\"ive exact
computation of this model incurs quadratic time and memory complexities in
sequence length, hindering the training of long-sequence models. Critical
bottlenecks are due to the computation of partition functions in the
denominator of softmax function as well as the multiplication of the softmax
matrix with the matrix of values. Our key observation is that the former can be
reduced to a variant of the kernel density estimation (KDE) problem, and an
efficient KDE solver can be further utilized to accelerate the latter via
subsampling-based fast matrix products. Our proposed KDEformer can approximate
the attention in sub-quadratic time with provable spectral norm bounds, while
all prior results merely provide entry-wise error bounds. Empirically, we
verify that KDEformer outperforms other attention approximations in terms of
accuracy, memory, and runtime on various pre-trained models. On BigGAN image
generation, we achieve better generative scores than the exact computation with
over $4\times$ speedup. For ImageNet classification with T2T-ViT, KDEformer
shows over $18\times$ speedup while the accuracy drop is less than $0.5\%$.
Related papers
- Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing.
These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators.
We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z) - PolySketchFormer: Fast Transformers via Sketching Polynomial Kernels [23.99075223506133]
We show that attention with high degree can effectively replace softmax without sacrificing model quality.
We present a block-based algorithm to apply causal masking efficiently.
We validate PolySketchFormerAttention empirically by training language models capable of handling long contexts.
arXiv Detail & Related papers (2023-10-02T21:39:04Z) - Decreasing the Computing Time of Bayesian Optimization using
Generalizable Memory Pruning [56.334116591082896]
We show a wrapper of memory pruning and bounded optimization capable of being used with any surrogate model and acquisition function.
Running BO on high-dimensional or massive data sets becomes intractable due to this time complexity.
All model implementations are run on the MIT Supercloud state-of-the-art computing hardware.
arXiv Detail & Related papers (2023-09-08T14:05:56Z) - Fast Private Kernel Density Estimation via Locality Sensitive
Quantization [10.227538355037554]
We study efficient mechanisms for differentially private kernel density estimation (DP-KDE)
We show how the kernel can privately be approximated in time linear in $d$, making it feasible for high-dimensional data.
arXiv Detail & Related papers (2023-07-04T18:48:04Z) - KrADagrad: Kronecker Approximation-Domination Gradient Preconditioned
Stochastic Optimization [69.47358238222586]
Second orderimations allow parameter update step size and direction to adapt to loss curvature.
Recently, Shampoo introduced a Kronecker factored preconditioner to reduce these requirements.
It takes inverse matrix roots of ill-conditioned matrices.
This requires 64-bit precision, imposing strong hardware constraints.
arXiv Detail & Related papers (2023-05-30T21:15:45Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - CROM: Continuous Reduced-Order Modeling of PDEs Using Implicit Neural
Representations [5.551136447769071]
Excessive runtime of high-fidelity partial differential equation solvers makes them unsuitable for time-critical applications.
We propose to accelerate PDE solvers using reduced-order modeling (ROM)
Our approach builds a smooth, low-dimensional manifold of the continuous vector fields themselves, not their discretization.
arXiv Detail & Related papers (2022-06-06T13:27:21Z) - SOFT: Softmax-free Transformer with Linear Complexity [112.9754491864247]
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention.
Various attempts on approximating the self-attention with linear complexity have been made in Natural Language Processing.
We identify that their limitations are rooted in keeping the softmax self-attention during approximations.
For the first time, a softmax-free transformer or SOFT is proposed.
arXiv Detail & Related papers (2021-10-22T17:57:29Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - Kernel Identification Through Transformers [54.3795894579111]
Kernel selection plays a central role in determining the performance of Gaussian Process (GP) models.
This work addresses the challenge of constructing custom kernel functions for high-dimensional GP regression models.
We introduce a novel approach named KITT: Kernel Identification Through Transformers.
arXiv Detail & Related papers (2021-06-15T14:32:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.