On The Computational Complexity of Self-Attention
- URL: http://arxiv.org/abs/2209.04881v1
- Date: Sun, 11 Sep 2022 14:38:10 GMT
- Title: On The Computational Complexity of Self-Attention
- Authors: Feyza Duman Keles, Pruthuvi Mahesakya Wijewardena, Chinmay Hegde
- Abstract summary: Modern transformers rely on the self-attention mechanism, whose time- and space-complexity is quadratic in the length of the input.
We prove that the time complexity of self-attention is necessarily quadratic in the input length, unless the Strong Exponential Time Hypothesis (SETH) is false.
As a complement to our lower bounds, we show that it is indeed possible to approximate dot-product self-attention using finite Taylor series in linear-time.
- Score: 22.3395465641384
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Transformer architectures have led to remarkable progress in many
state-of-art applications. However, despite their successes, modern
transformers rely on the self-attention mechanism, whose time- and
space-complexity is quadratic in the length of the input. Several approaches
have been proposed to speed up self-attention mechanisms to achieve
sub-quadratic running time; however, the large majority of these works are not
accompanied by rigorous error guarantees. In this work, we establish lower
bounds on the computational complexity of self-attention in a number of
scenarios. We prove that the time complexity of self-attention is necessarily
quadratic in the input length, unless the Strong Exponential Time Hypothesis
(SETH) is false. This argument holds even if the attention computation is
performed only approximately, and for a variety of attention mechanisms. As a
complement to our lower bounds, we show that it is indeed possible to
approximate dot-product self-attention using finite Taylor series in
linear-time, at the cost of having an exponential dependence on the polynomial
order.
Related papers
- Tensor Attention Training: Provably Efficient Learning of Higher-order Transformers [29.27113653850964]
We prove that the backward gradient of tensor attention training can be computed in almost linear $n1+o(1)$ time.
Our results establish the feasibility of efficient higher-order transformer training and may facilitate practical applications of tensor attention architectures.
arXiv Detail & Related papers (2024-05-26T02:59:13Z) - Taming Quantum Time Complexity [50.10645865330582]
We show how to achieve both exactness and thriftiness in the setting of time complexity.
We employ a novel approach to the design of quantum algorithms based on what we call transducers.
arXiv Detail & Related papers (2023-11-27T14:45:19Z) - An Analysis of On-the-fly Determinization of Finite-state Automata [65.268245109828]
We establish an abstraction of on-the-fly determinization of finite-state automata and demonstrate how it can be applied to bound the automatons.
A special case of our findings is that automata with many non-deterministic transitions almost always admit a determinization of complexity.
arXiv Detail & Related papers (2023-08-27T11:51:27Z) - Fragmented imaginary-time evolution for early-stage quantum signal
processors [0.0]
Simulating quantum imaginary-time evolution (QITE) is a major promise of quantum computation.
Our main contribution is a new generation of deterministic, high-precision QITE algorithms.
We present two QITE-circuit sub-routines with excellent complexity scaling.
arXiv Detail & Related papers (2021-10-25T18:02:24Z) - SOFT: Softmax-free Transformer with Linear Complexity [112.9754491864247]
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention.
Various attempts on approximating the self-attention with linear complexity have been made in Natural Language Processing.
We identify that their limitations are rooted in keeping the softmax self-attention during approximations.
For the first time, a softmax-free transformer or SOFT is proposed.
arXiv Detail & Related papers (2021-10-22T17:57:29Z) - Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity.
We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z) - Autoformer: Decomposition Transformers with Auto-Correlation for
Long-Term Series Forecasting [68.86835407617778]
Autoformer is a novel decomposition architecture with an Auto-Correlation mechanism.
In long-term forecasting, Autoformer yields state-of-the-art accuracy, with a relative improvement on six benchmarks.
arXiv Detail & Related papers (2021-06-24T13:43:43Z) - Model Checking Quantum Continuous-Time Markov Chains [11.182363315152399]
We initialised the model checking of quantum continuous-time Markov chain (QCTMC)
As a real-time system, we specify the temporal properties on QCTMC by signal temporal logic (STL)
arXiv Detail & Related papers (2021-05-02T02:46:19Z) - Revisiting Linformer with a modified self-attention with linear
complexity [0.0]
I propose an alternative method for self-attention with linear complexity in time and space.
Since this method works for long sequences this can be used for images as well as audios.
arXiv Detail & Related papers (2020-12-16T13:23:29Z) - Open quantum systems beyond Fermi's golden rule: Diagrammatic expansion
of the steady-state time-convolutionless master equation [0.0]
We develop a diagrammatic approach to evaluate the steady-state TCL generator based on operators rather than superoperators.
We benchmark our method on a single non-interacting level coupled to Fermi reservoirs, where we recover the exact expansion to next-to-leading order.
arXiv Detail & Related papers (2020-10-19T20:27:31Z) - Estimating the entropy of shallow circuit outputs is hard [77.34726150561087]
Decision problem version of estimating Shannon entropy is the Entropy Difference (ED)
analogous problem with quantum circuits (QED)
We show that relative to an oracle, these problems cannot be as hard as their counterparts with exponentially larger circuits.
arXiv Detail & Related papers (2020-02-27T15:32:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.