On the Computational Power of Transformers and its Implications in
Sequence Modeling
- URL: http://arxiv.org/abs/2006.09286v3
- Date: Sat, 10 Oct 2020 13:34:20 GMT
- Title: On the Computational Power of Transformers and its Implications in
Sequence Modeling
- Authors: Satwik Bhattamishra, Arkil Patel, Navin Goyal
- Abstract summary: In particular, the roles of various components in Transformers such as positional encodings, attention heads, residual connections, and feedforward networks, are not clear.
We provide an alternate and simpler proof to show that vanilla Transformers are Turing-complete.
We further analyze the necessity of each component for the Turing-completeness of the network; interestingly, we find that a particular type of residual connection is necessary.
- Score: 10.497742214344855
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers are being used extensively across several sequence modeling
tasks. Significant research effort has been devoted to experimentally probe the
inner workings of Transformers. However, our conceptual and theoretical
understanding of their power and inherent limitations is still nascent. In
particular, the roles of various components in Transformers such as positional
encodings, attention heads, residual connections, and feedforward networks, are
not clear. In this paper, we take a step towards answering these questions. We
analyze the computational power as captured by Turing-completeness. We first
provide an alternate and simpler proof to show that vanilla Transformers are
Turing-complete and then we prove that Transformers with only positional
masking and without any positional encoding are also Turing-complete. We
further analyze the necessity of each component for the Turing-completeness of
the network; interestingly, we find that a particular type of residual
connection is necessary. We demonstrate the practical implications of our
results via experiments on machine translation and synthetic tasks.
Related papers
- Transformers are Expressive, But Are They Expressive Enough for Regression? [38.369337945109855]
We show that Transformers struggle to reliably approximate smooth functions, relying on piecewise constant approximations with sizable intervals.
By shedding light on these challenges, we advocate a refined understanding of Transformers' capabilities.
arXiv Detail & Related papers (2024-02-23T18:12:53Z) - On the Expressive Power of a Variant of the Looped Transformer [83.30272757948829]
We design a novel transformer block, dubbed AlgoFormer, to empower transformers with algorithmic capabilities.
The proposed AlgoFormer can achieve significantly higher in algorithm representation when using the same number of parameters.
Some theoretical and empirical results are presented to show that the designed transformer has the potential to be smarter than human-designed algorithms.
arXiv Detail & Related papers (2024-02-21T07:07:54Z) - Are Transformers with One Layer Self-Attention Using Low-Rank Weight
Matrices Universal Approximators? [37.820617032391404]
We show that a single layer of self-attention with low-rank weight matrices possesses the capability to perfectly capture the context of an entire input sequence.
One-layer and single-head Transformers have a memorization capacity for finite samples, and that Transformers consisting of one self-attention layer with two feed-forward neural networks are universal approximators for continuous permutation equivariant functions on a compact domain.
arXiv Detail & Related papers (2023-07-26T08:07:37Z) - Approximation and Estimation Ability of Transformers for
Sequence-to-Sequence Functions with Infinite Dimensional Input [50.83356836818667]
We study the approximation and estimation ability of Transformers as sequence-to-sequence functions with infinite dimensional inputs.
Our theoretical results support the practical success of Transformers for high dimensional data.
arXiv Detail & Related papers (2023-05-30T02:44:49Z) - How Powerful are Decoder-Only Transformer Neural Models? [0.0]
This is the first work to address the Turing completeness of the underlying technology employed in GPT-x.
We show that the sparsity/compressibility of the word embedding is an important consideration for Turing completeness to hold.
arXiv Detail & Related papers (2023-05-26T15:35:43Z) - An Introduction to Transformers [23.915718146956355]
transformer is a neural network component that can be used to learn useful sequences or sets of data-points.
In this note we aim for a mathematically precise, intuitive, and clean description of the transformer architecture.
arXiv Detail & Related papers (2023-04-20T14:54:19Z) - Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions.
We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is.
We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z) - On the Power of Saturated Transformers: A View from Circuit Complexity [87.20342701232869]
We show that saturated transformers transcend the limitations of hard-attention transformers.
The jump from hard to saturated attention can be understood as increasing the transformer's effective circuit depth by a factor of $O(log n)$.
arXiv Detail & Related papers (2021-06-30T17:09:47Z) - Thinking Like Transformers [64.96770952820691]
We propose a computational model for the transformer-encoder in the form of a programming language.
We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer.
We provide RASP programs for histograms, sorting, and Dyck-languages.
arXiv Detail & Related papers (2021-06-13T13:04:46Z) - Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation.
We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters.
A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.