On The Computational Complexity of Self-Attention
- URL: http://arxiv.org/abs/2209.04881v1
- Date: Sun, 11 Sep 2022 14:38:10 GMT
- Title: On The Computational Complexity of Self-Attention
- Authors: Feyza Duman Keles, Pruthuvi Mahesakya Wijewardena, Chinmay Hegde
- Abstract summary: Modern transformers rely on the self-attention mechanism, whose time- and space-complexity is quadratic in the length of the input.
We prove that the time complexity of self-attention is necessarily quadratic in the input length, unless the Strong Exponential Time Hypothesis (SETH) is false.
As a complement to our lower bounds, we show that it is indeed possible to approximate dot-product self-attention using finite Taylor series in linear-time.
- Score: 22.3395465641384
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Transformer architectures have led to remarkable progress in many
state-of-art applications. However, despite their successes, modern
transformers rely on the self-attention mechanism, whose time- and
space-complexity is quadratic in the length of the input. Several approaches
have been proposed to speed up self-attention mechanisms to achieve
sub-quadratic running time; however, the large majority of these works are not
accompanied by rigorous error guarantees. In this work, we establish lower
bounds on the computational complexity of self-attention in a number of
scenarios. We prove that the time complexity of self-attention is necessarily
quadratic in the input length, unless the Strong Exponential Time Hypothesis
(SETH) is false. This argument holds even if the attention computation is
performed only approximately, and for a variety of attention mechanisms. As a
complement to our lower bounds, we show that it is indeed possible to
approximate dot-product self-attention using finite Taylor series in
linear-time, at the cost of having an exponential dependence on the polynomial
order.
Related papers
- Tensor Attention Training: Provably Efficient Learning of Higher-order Transformers [18.331374727331077]
Time complexity of tensor attention is a significant obstacle to its utilization in transformers.
We prove that the backward gradient of attention training can be computed in almost linear time.
arXiv Detail & Related papers (2024-05-26T02:59:13Z) - An Analysis of On-the-fly Determinization of Finite-state Automata [65.268245109828]
We establish an abstraction of on-the-fly determinization of finite-state automata and demonstrate how it can be applied to bound the automatons.
A special case of our findings is that automata with many non-deterministic transitions almost always admit a determinization of complexity.
arXiv Detail & Related papers (2023-08-27T11:51:27Z) - Oracle separations of hybrid quantum-classical circuits [68.96380145211093]
Two models of quantum computation: CQ_d and QC_d.
CQ_d captures the scenario of a d-depth quantum computer many times; QC_d is more analogous to measurement-based quantum computation.
We show that, despite the similarities between CQ_d and QC_d, the two models are intrinsically, i.e. CQ_d $nsubseteq$ QC_d and QC_d $nsubseteq$ CQ_d relative to an oracle.
arXiv Detail & Related papers (2022-01-06T03:10:53Z) - Fragmented imaginary-time evolution for early-stage quantum signal
processors [0.0]
Simulating quantum imaginary-time evolution (QITE) is a major promise of quantum computation.
Our main contribution is a new generation of deterministic, high-precision QITE algorithms.
We present two QITE-circuit sub-routines with excellent complexity scaling.
arXiv Detail & Related papers (2021-10-25T18:02:24Z) - SOFT: Softmax-free Transformer with Linear Complexity [112.9754491864247]
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention.
Various attempts on approximating the self-attention with linear complexity have been made in Natural Language Processing.
We identify that their limitations are rooted in keeping the softmax self-attention during approximations.
For the first time, a softmax-free transformer or SOFT is proposed.
arXiv Detail & Related papers (2021-10-22T17:57:29Z) - Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity.
We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z) - Autoformer: Decomposition Transformers with Auto-Correlation for
Long-Term Series Forecasting [68.86835407617778]
Autoformer is a novel decomposition architecture with an Auto-Correlation mechanism.
In long-term forecasting, Autoformer yields state-of-the-art accuracy, with a relative improvement on six benchmarks.
arXiv Detail & Related papers (2021-06-24T13:43:43Z) - Model Checking Quantum Continuous-Time Markov Chains [11.182363315152399]
We initialised the model checking of quantum continuous-time Markov chain (QCTMC)
As a real-time system, we specify the temporal properties on QCTMC by signal temporal logic (STL)
arXiv Detail & Related papers (2021-05-02T02:46:19Z) - Capturing Multi-Resolution Context by Dilated Self-Attention [58.69803243323346]
We propose a combination of restricted self-attention and a dilation mechanism, which we refer to as dilated self-attention.
The restricted self-attention allows attention to neighboring frames of the query at a high resolution, and the dilation mechanism summarizes distant information to allow attending to it with a lower resolution.
ASR results demonstrate substantial improvements compared to restricted self-attention alone, achieving similar results compared to full-sequence based self-attention with a fraction of the computational costs.
arXiv Detail & Related papers (2021-04-07T02:04:18Z) - Revisiting Linformer with a modified self-attention with linear
complexity [0.0]
I propose an alternative method for self-attention with linear complexity in time and space.
Since this method works for long sequences this can be used for images as well as audios.
arXiv Detail & Related papers (2020-12-16T13:23:29Z) - Open quantum systems beyond Fermi's golden rule: Diagrammatic expansion
of the steady-state time-convolutionless master equation [0.0]
We develop a diagrammatic approach to evaluate the steady-state TCL generator based on operators rather than superoperators.
We benchmark our method on a single non-interacting level coupled to Fermi reservoirs, where we recover the exact expansion to next-to-leading order.
arXiv Detail & Related papers (2020-10-19T20:27:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.