Going Beyond Linear Transformers with Recurrent Fast Weight Programmers
- URL: http://arxiv.org/abs/2106.06295v1
- Date: Fri, 11 Jun 2021 10:32:11 GMT
- Title: Going Beyond Linear Transformers with Recurrent Fast Weight Programmers
- Authors: Kazuki Irie, Imanol Schlag, R\'obert Csord\'as, J\"urgen Schmidhuber
- Abstract summary: We introduce recurrent Fast Weight Programmers (RFWPs)
We evaluate our novel recurrent FWPs (RFWPs) on two synthetic algorithmic tasks, Wikitext-103 language models, and on the Atari 2600 2D game environment.
In the reinforcement learning setting, we report large improvements over LSTM in several Atari games.
- Score: 9.216201990315364
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers with linearised attention ("linear Transformers") have
demonstrated the practical scalability and effectiveness of outer product-based
Fast Weight Programmers (FWPs) from the '90s. However, the original FWP
formulation is more general than the one of linear Transformers: a slow neural
network (NN) continually reprograms the weights of a fast NN with arbitrary NN
architectures. In existing linear Transformers, both NNs are feedforward and
consist of a single layer. Here we explore new variations by adding recurrence
to the slow and fast nets. We evaluate our novel recurrent FWPs (RFWPs) on two
synthetic algorithmic tasks (code execution and sequential ListOps),
Wikitext-103 language models, and on the Atari 2600 2D game environment. Our
models exhibit properties of Transformers and RNNs. In the reinforcement
learning setting, we report large improvements over LSTM in several Atari
games. Our code is public.
Related papers
- Parallelizing Linear Transformers with the Delta Rule over Sequence Length [49.88826673324244]
This work describes a hardware-efficient algorithm for training linear transformers with the delta rule.
We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines.
arXiv Detail & Related papers (2024-06-10T17:24:42Z) - Attention as an RNN [66.5420926480473]
We show that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its textitmany-to-one RNN output efficiently.
We introduce a new efficient method of computing attention's textitmany-to-many RNN output based on the parallel prefix scan algorithm.
We show Aarens achieve comparable performance to Transformers on $38$ datasets spread across four popular sequential problem settings.
arXiv Detail & Related papers (2024-05-22T19:45:01Z) - Practical Computational Power of Linear Transformers and Their Recurrent
and Self-Referential Extensions [15.793406740545024]
We study auto-regressive Transformers with linearised attention, a.k.a. linear Transformers (LTs) or Fast Weight Programmers (FWPs)
LTs are special in the sense that they are equivalent to RNN-like sequence processors with a fixed-size state, while they can also be expressed as the now-popular self-attention networks.
arXiv Detail & Related papers (2023-10-24T17:17:01Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions.
We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is.
We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z) - B-cos Networks: Alignment is All We Need for Interpretability [136.27303006772294]
We present a new direction for increasing the interpretability of deep neural networks (DNNs) by promoting weight-input alignment during training.
A B-cos transform induces a single linear transform that faithfully summarises the full model computations.
We show that it can easily be integrated into common models such as VGGs, ResNets, InceptionNets, and DenseNets.
arXiv Detail & Related papers (2022-05-20T16:03:29Z) - Are Transformers More Robust? Towards Exact Robustness Verification for
Transformers [3.2259574483835673]
We study the robustness problem of Transformers, a key characteristic as low robustness may cause safety concerns.
Specifically, we focus on Sparsemax-based Transformers and reduce the finding of their maximum robustness to a Mixed Quadratically Constrained Programming (MIQCP) problem.
We then conduct experiments using the application of Land Departure to compare the robustness of Sparsemax-based Transformers against that of the more conventional Multi-Layer-Perceptron (MLP) NNs.
arXiv Detail & Related papers (2022-02-08T15:27:33Z) - FNet: Mixing Tokens with Fourier Transforms [0.578717214982749]
We show that Transformer encoder architectures can be massively sped up with limited accuracy costs.
We replace the self-attention sublayers with simple linear transformations that "mix" input tokens.
The resulting model, which we name FNet, scales very efficiently to long inputs.
arXiv Detail & Related papers (2021-05-09T03:32:48Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - Transformers are RNNs: Fast Autoregressive Transformers with Linear
Attention [22.228028613802174]
Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, they are prohibitively slow for very long sequences.
We make use of the associativity property of matrix products to reduce the complexity from $mathcalOleft(N2right)$ to $mathcalOleft(Nright)$, where $N$ is the sequence length.
Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences.
arXiv Detail & Related papers (2020-06-29T17:55:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.