$O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers
- URL: http://arxiv.org/abs/2006.04862v2
- Date: Sat, 19 Dec 2020 07:16:15 GMT
- Title: $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers
- Authors: Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat,
Sashank J. Reddi, Sanjiv Kumar
- Abstract summary: We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
- Score: 71.31712741938837
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Transformer networks have redefined the state of the art in many
NLP tasks. However, these models suffer from quadratic computational cost in
the input sequence length $n$ to compute pairwise attention in each layer. This
has prompted recent research into sparse Transformers that sparsify the
connections in the attention layers. While empirically promising for long
sequences, fundamental questions remain unanswered: Can sparse Transformers
approximate any arbitrary sequence-to-sequence function, similar to their dense
counterparts? How does the sparsity pattern and the sparsity level affect their
performance? In this paper, we address these questions and provide a unifying
framework that captures existing sparse attention models. We propose sufficient
conditions under which we prove that a sparse attention model can universally
approximate any sequence-to-sequence function. Surprisingly, our results show
that sparse Transformers with only $O(n)$ connections per attention layer can
approximate the same function class as the dense model with $n^2$ connections.
Lastly, we present experiments comparing different patterns/levels of sparsity
on standard NLP tasks.
Related papers
- On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks.
We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z) - Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - Mixture-of-Depths: Dynamically allocating compute in transformer-based language models [8.774705201394916]
Transformer-based language models spread FLOPs uniformly across input sequences.
We show that transformers can learn to dynamically allocate FLOPs to specific positions in a sequence.
arXiv Detail & Related papers (2024-04-02T19:28:11Z) - Most Likely Sequence Generation for $n$-Grams, Transformers, HMMs, and Markov Chains, by Using Rollout Algorithms [3.014160809522789]
We consider a transformer with a $n$-gram structure, such as the one underlying ChatGPT.
The transformer provides next word probabilities, which can be used to generate word sequences.
We consider methods for computing word sequences that are highly likely, based on these probabilities.
arXiv Detail & Related papers (2024-03-19T19:58:46Z) - Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences [1.5484595752241124]
We present Fast Multipole Attention, a new attention mechanism that uses a divide-and-conquer strategy to reduce the time and memory complexity of attention for sequences of length $n$.
The hierarchical approach groups queries, keys, and values into $mathcalO( log n)$ levels of resolution, where groups at greater distances are larger in size and the weights to compute group quantities are learned.
We find empirically that the Fast Multipole Transformer performs much better than other efficient transformers in terms of memory size and accuracy.
arXiv Detail & Related papers (2023-10-18T13:40:41Z) - Transformers Learn Shortcuts to Automata [52.015990420075944]
We find that a low-depth Transformer can represent the computations of any finite-state automaton.
We show that a Transformer with $O(log T)$ layers can exactly replicate the computation of an automaton on an input sequence of length $T$.
We further investigate the brittleness of these solutions and propose potential mitigations.
arXiv Detail & Related papers (2022-10-19T17:45:48Z) - What Dense Graph Do You Need for Self-Attention? [73.82686008622596]
We present Hypercube Transformer, a sparse Transformer that models token interactions in a hypercube and shows comparable or even better results with vanilla Transformer.
Experiments on tasks requiring various sequence lengths lay validation for our graph function well.
arXiv Detail & Related papers (2022-05-27T14:36:55Z) - Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity.
We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.