Practical Computational Power of Linear Transformers and Their Recurrent
and Self-Referential Extensions
- URL: http://arxiv.org/abs/2310.16076v1
- Date: Tue, 24 Oct 2023 17:17:01 GMT
- Title: Practical Computational Power of Linear Transformers and Their Recurrent
and Self-Referential Extensions
- Authors: Kazuki Irie, R\'obert Csord\'as, J\"urgen Schmidhuber
- Abstract summary: We study auto-regressive Transformers with linearised attention, a.k.a. linear Transformers (LTs) or Fast Weight Programmers (FWPs)
LTs are special in the sense that they are equivalent to RNN-like sequence processors with a fixed-size state, while they can also be expressed as the now-popular self-attention networks.
- Score: 15.793406740545024
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies of the computational power of recurrent neural networks (RNNs)
reveal a hierarchy of RNN architectures, given real-time and finite-precision
assumptions. Here we study auto-regressive Transformers with linearised
attention, a.k.a. linear Transformers (LTs) or Fast Weight Programmers (FWPs).
LTs are special in the sense that they are equivalent to RNN-like sequence
processors with a fixed-size state, while they can also be expressed as the
now-popular self-attention networks. We show that many well-known results for
the standard Transformer directly transfer to LTs/FWPs. Our formal language
recognition experiments demonstrate how recently proposed FWP extensions such
as recurrent FWPs and self-referential weight matrices successfully overcome
certain limitations of the LT, e.g., allowing for generalisation on the parity
problem. Our code is public.
Related papers
- Bypassing the Exponential Dependency: Looped Transformers Efficiently Learn In-context by Multi-step Gradient Descent [26.764893400499354]
We show that linear looped Transformers can implement multi-step gradient descent efficiently for in-context learning.
Our results demonstrate that as long as the input data has a constant condition number, $n = O(d)$, the linear looped Transformers can achieve a small error.
arXiv Detail & Related papers (2024-10-15T04:44:23Z) - Learning Linear Attention in Polynomial Time [115.68795790532289]
We provide the first results on learnability of single-layer Transformers with linear attention.
We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS.
We show how to efficiently identify training datasets for which every empirical riskr is equivalent to the linear Transformer.
arXiv Detail & Related papers (2024-10-14T02:41:01Z) - Attention as an RNN [66.5420926480473]
We show that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its textitmany-to-one RNN output efficiently.
We introduce a new efficient method of computing attention's textitmany-to-many RNN output based on the parallel prefix scan algorithm.
We show Aarens achieve comparable performance to Transformers on $38$ datasets spread across four popular sequential problem settings.
arXiv Detail & Related papers (2024-05-22T19:45:01Z) - Self-Supervised Pre-Training for Table Structure Recognition Transformer [25.04573593082671]
We propose a self-supervised pre-training (SSP) method for table structure recognition transformers.
We discover that the performance gap between the linear projection transformer and the hybrid CNN-transformer can be mitigated by SSP of the visual encoder in the TSR model.
arXiv Detail & Related papers (2024-02-23T19:34:06Z) - Deep Transformers without Shortcuts: Modifying Self-attention for
Faithful Signal Propagation [105.22961467028234]
Skip connections and normalisation layers are ubiquitous for the training of Deep Neural Networks (DNNs)
Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them.
But these approaches are incompatible with the self-attention layers present in transformers.
arXiv Detail & Related papers (2023-02-20T21:26:25Z) - Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions.
We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is.
We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z) - Block-Recurrent Transformers [49.07682696216708]
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence.
Our recurrent cell operates on blocks of tokens rather than single tokens, and leverages parallel computation within a block in order to make efficient use of accelerator hardware.
arXiv Detail & Related papers (2022-03-11T23:44:33Z) - Learning Bounded Context-Free-Grammar via LSTM and the
Transformer:Difference and Explanations [51.77000472945441]
Long Short-Term Memory (LSTM) and Transformers are two popular neural architectures used for natural language processing tasks.
In practice, it is often observed that Transformer models have better representation power than LSTM.
We study such practical differences between LSTM and Transformer and propose an explanation based on their latent space decomposition patterns.
arXiv Detail & Related papers (2021-12-16T19:56:44Z) - Going Beyond Linear Transformers with Recurrent Fast Weight Programmers [9.216201990315364]
We introduce recurrent Fast Weight Programmers (RFWPs)
We evaluate our novel recurrent FWPs (RFWPs) on two synthetic algorithmic tasks, Wikitext-103 language models, and on the Atari 2600 2D game environment.
In the reinforcement learning setting, we report large improvements over LSTM in several Atari games.
arXiv Detail & Related papers (2021-06-11T10:32:11Z) - The Power of Linear Recurrent Neural Networks [1.124958340749622]
We show how autoregressive linear, i.e., linearly activated recurrent neural networks (LRNNs) can approximate any time-dependent function f(t)
LRNNs outperform the previous state-of-the-art for the MSO task with a minimal number of units.
arXiv Detail & Related papers (2018-02-09T15:35:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.