$O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers
- URL: http://arxiv.org/abs/2006.04862v2
- Date: Sat, 19 Dec 2020 07:16:15 GMT
- Title: $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers
- Authors: Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat,
Sashank J. Reddi, Sanjiv Kumar
- Abstract summary: We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
- Score: 71.31712741938837
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Transformer networks have redefined the state of the art in many
NLP tasks. However, these models suffer from quadratic computational cost in
the input sequence length $n$ to compute pairwise attention in each layer. This
has prompted recent research into sparse Transformers that sparsify the
connections in the attention layers. While empirically promising for long
sequences, fundamental questions remain unanswered: Can sparse Transformers
approximate any arbitrary sequence-to-sequence function, similar to their dense
counterparts? How does the sparsity pattern and the sparsity level affect their
performance? In this paper, we address these questions and provide a unifying
framework that captures existing sparse attention models. We propose sufficient
conditions under which we prove that a sparse attention model can universally
approximate any sequence-to-sequence function. Surprisingly, our results show
that sparse Transformers with only $O(n)$ connections per attention layer can
approximate the same function class as the dense model with $n^2$ connections.
Lastly, we present experiments comparing different patterns/levels of sparsity
on standard NLP tasks.
Related papers
- Mixture-of-Depths: Dynamically allocating compute in transformer-based language models [8.774705201394916]
Transformer-based language models spread FLOPs uniformly across input sequences.
We show that transformers can learn to dynamically allocate FLOPs to specific positions in a sequence.
arXiv Detail & Related papers (2024-04-02T19:28:11Z) - Most Likely Sequence Generation for $n$-Grams, Transformers, HMMs, and Markov Chains, by Using Rollout Algorithms [3.014160809522789]
We consider a transformer with a $n$-gram structure, such as the one underlying ChatGPT.
The transformer provides next word probabilities, which can be used to generate word sequences.
We consider methods for computing word sequences that are highly likely, based on these probabilities.
arXiv Detail & Related papers (2024-03-19T19:58:46Z) - Chain of Thought Empowers Transformers to Solve Inherently Serial Problems [57.58801785642868]
Chain of thought (CoT) is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks.
This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness.
arXiv Detail & Related papers (2024-02-20T10:11:03Z) - Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for
Long Sequences [1.7403133838762448]
We present Fast Multipole Attention, a new attention mechanism that uses a divide-and-conquer strategy to reduce the time and memory complexity of attention for sequences of length $n$.
The hierarchical approach groups queries, keys, and values into $mathcalO( log n)$ levels of resolution, where groups at greater distances are larger in size and the weights to compute group quantities are learned.
We find empirically that the Fast Multipole Transformer performs much better than other efficient transformers in terms of memory size and accuracy.
arXiv Detail & Related papers (2023-10-18T13:40:41Z) - Sampled Transformer for Point Sets [80.66097006145999]
sparse transformer can reduce the computational complexity of the self-attention layers to $O(n)$, whilst still being a universal approximator of continuous sequence-to-sequence functions.
We propose an $O(n)$ complexity sampled transformer that can process point set elements directly without any additional inductive bias.
arXiv Detail & Related papers (2023-02-28T06:38:05Z) - Transformers Learn Shortcuts to Automata [52.015990420075944]
We find that a low-depth Transformer can represent the computations of any finite-state automaton.
We show that a Transformer with $O(log T)$ layers can exactly replicate the computation of an automaton on an input sequence of length $T$.
We further investigate the brittleness of these solutions and propose potential mitigations.
arXiv Detail & Related papers (2022-10-19T17:45:48Z) - What Dense Graph Do You Need for Self-Attention? [73.82686008622596]
We present Hypercube Transformer, a sparse Transformer that models token interactions in a hypercube and shows comparable or even better results with vanilla Transformer.
Experiments on tasks requiring various sequence lengths lay validation for our graph function well.
arXiv Detail & Related papers (2022-05-27T14:36:55Z) - Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity.
We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.