On the Power of Saturated Transformers: A View from Circuit Complexity
- URL: http://arxiv.org/abs/2106.16213v1
- Date: Wed, 30 Jun 2021 17:09:47 GMT
- Title: On the Power of Saturated Transformers: A View from Circuit Complexity
- Authors: William Merrill and Yoav Goldberg and Roy Schwartz and Noah A. Smith
- Abstract summary: We show that saturated transformers transcend the limitations of hard-attention transformers.
The jump from hard to saturated attention can be understood as increasing the transformer's effective circuit depth by a factor of $O(log n)$.
- Score: 87.20342701232869
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have become a standard architecture for many NLP problems. This
has motivated theoretically analyzing their capabilities as models of language,
in order to understand what makes them successful, and what their potential
weaknesses might be. Recent work has shown that transformers with hard
attention are quite limited in capacity, and in fact can be simulated by
constant-depth circuits. However, hard attention is a restrictive assumption,
which may complicate the relevance of these results for practical transformers.
In this work, we analyze the circuit complexity of transformers with saturated
attention: a generalization of hard attention that more closely captures the
attention patterns learnable in practical transformers. We show that saturated
transformers transcend the limitations of hard-attention transformers. With
some minor assumptions, we prove that the number of bits needed to represent a
saturated transformer memory vector is $O(\log n)$, which implies saturated
transformers can be simulated by log-depth circuits. Thus, the jump from hard
to saturated attention can be understood as increasing the transformer's
effective circuit depth by a factor of $O(\log n)$.
Related papers
- On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks.
We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z) - Extracting Finite State Machines from Transformers [0.3069335774032178]
We investigate the trainability of transformers trained on regular languages from a mechanistic interpretability perspective.
We empirically find tighter lower bounds on the trainability of transformers, when a finite number of symbols determine the state.
Our mechanistic insight allows us to characterise the regular languages a one-layer transformer can learn with good length generalisation.
arXiv Detail & Related papers (2024-10-08T13:43:50Z) - Transformers are Expressive, But Are They Expressive Enough for Regression? [38.369337945109855]
We show that Transformers struggle to reliably approximate smooth functions, relying on piecewise constant approximations with sizable intervals.
By shedding light on these challenges, we advocate a refined understanding of Transformers' capabilities.
arXiv Detail & Related papers (2024-02-23T18:12:53Z) - On the Expressive Power of a Variant of the Looped Transformer [83.30272757948829]
We design a novel transformer block, dubbed AlgoFormer, to empower transformers with algorithmic capabilities.
The proposed AlgoFormer can achieve significantly higher in algorithm representation when using the same number of parameters.
Some theoretical and empirical results are presented to show that the designed transformer has the potential to be smarter than human-designed algorithms.
arXiv Detail & Related papers (2024-02-21T07:07:54Z) - The Expressive Power of Transformers with Chain of Thought [29.839710738657203]
In practice, transformers can be improved by allowing them to use a "chain of thought" or "scratchpad"
We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation.
Our results also imply that linear steps keep transformer decoders within context-sensitive languages.
arXiv Detail & Related papers (2023-10-11T22:35:18Z) - Linear attention is (maybe) all you need (to understand transformer
optimization) [55.81555204646486]
We make progress towards understanding the subtleties of training Transformers by studying a simple yet canonicalized shallow Transformer model.
Most importantly, we observe that our proposed linearized models can reproduce several prominent aspects of Transformer training dynamics.
arXiv Detail & Related papers (2023-10-02T10:48:42Z) - The Parallelism Tradeoff: Limitations of Log-Precision Transformers [29.716269397142973]
We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens can be simulated by constant-depth logspace-uniform threshold circuits.
This provides insight on the power of transformers using known results in complexity theory.
arXiv Detail & Related papers (2022-07-02T03:49:34Z) - Thinking Like Transformers [64.96770952820691]
We propose a computational model for the transformer-encoder in the form of a programming language.
We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer.
We provide RASP programs for histograms, sorting, and Dyck-languages.
arXiv Detail & Related papers (2021-06-13T13:04:46Z) - Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation.
We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters.
A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.