Gated recurrent neural networks discover attention
- URL: http://arxiv.org/abs/2309.01775v2
- Date: Wed, 7 Feb 2024 11:30:01 GMT
- Title: Gated recurrent neural networks discover attention
- Authors: Nicolas Zucchet, Seijin Kobayashi, Yassir Akram, Johannes von Oswald,
Maxime Larcher, Angelika Steger, Jo\~ao Sacramento
- Abstract summary: Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers.
We show how RNNs equipped with linear recurrent layers interconnected by feedforward paths with multiplicative gating can implement self-attention.
Our findings highlight the importance of multiplicative interactions in neural networks and suggest that certain RNNs might be unexpectedly implementing attention under the hood.
- Score: 9.113450161370361
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent architectural developments have enabled recurrent neural networks
(RNNs) to reach and even surpass the performance of Transformers on certain
sequence modeling tasks. These modern RNNs feature a prominent design pattern:
linear recurrent layers interconnected by feedforward paths with multiplicative
gating. Here, we show how RNNs equipped with these two design elements can
exactly implement (linear) self-attention, the main building block of
Transformers. By reverse-engineering a set of trained RNNs, we find that
gradient descent in practice discovers our construction. In particular, we
examine RNNs trained to solve simple in-context learning tasks on which
Transformers are known to excel and find that gradient descent instills in our
RNNs the same attention-based in-context learning algorithm used by
Transformers. Our findings highlight the importance of multiplicative
interactions in neural networks and suggest that certain RNNs might be
unexpectedly implementing attention under the hood.
Related papers
- On the Design Space Between Transformers and Recursive Neural Nets [64.862738244735]
Continuous Recursive Neural Networks (CRvNN) and Neural Data Routers (NDR) are studied.
CRvNN pushes the boundaries of traditional RvNN, relaxing its discrete structure-wise composition and ending up with a Transformer-like structure.
NDR constrains the original Transformer to induce better structural inductive bias, ending up with a model that is close to CRvNN.
arXiv Detail & Related papers (2024-09-03T02:03:35Z) - Investigating Sparsity in Recurrent Neural Networks [0.0]
This thesis focuses on investigating the effects of pruning and Sparse Recurrent Neural Networks on the performance of RNNs.
We first describe the pruning of RNNs, its impact on the performance of RNNs, and the number of training epochs required to regain accuracy after the pruning is performed.
Next, we continue with the creation and training of Sparse Recurrent Neural Networks and identify the relation between the performance and the graph properties of its underlying arbitrary structure.
arXiv Detail & Related papers (2024-07-30T07:24:58Z) - Attention as an RNN [66.5420926480473]
We show that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its textitmany-to-one RNN output efficiently.
We introduce a new efficient method of computing attention's textitmany-to-many RNN output based on the parallel prefix scan algorithm.
We show Aarens achieve comparable performance to Transformers on $38$ datasets spread across four popular sequential problem settings.
arXiv Detail & Related papers (2024-05-22T19:45:01Z) - Use of Parallel Explanatory Models to Enhance Transparency of Neural Network Configurations for Cell Degradation Detection [18.214293024118145]
We build a parallel model to illuminate and understand the internal operation of neural networks.
We show how each layer of the RNN transforms the input distributions to increase detection accuracy.
At the same time we also discover a side effect acting to limit the improvement in accuracy.
arXiv Detail & Related papers (2024-04-17T12:22:54Z) - RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval [14.378613219812221]
We focus on understanding whether RNNs, known for their memory efficiency in handling long sequences, can match the performance of Transformers.
A key bottleneck lies in the inability of RNNs to perfectly retrieve information from the context, even with Chain-of-Thought (CoT)
We prove that adopting techniques to enhance the in-context retrieval capability of RNNs, including Retrieval-Augmented Generation (RAG) and adding a single Transformer layer, can elevate RNNs to be capable of solving all-time solvable problems with CoT.
arXiv Detail & Related papers (2024-02-28T17:38:06Z) - NAR-Former V2: Rethinking Transformer for Universal Neural Network
Representation Learning [25.197394237526865]
We propose a modified Transformer-based universal neural network representation learning model NAR-Former V2.
Specifically, we take the network as a graph and design a straightforward tokenizer to encode the network into a sequence.
We incorporate the inductive representation learning capability of GNN into Transformer, enabling Transformer to generalize better when encountering unseen architecture.
arXiv Detail & Related papers (2023-06-19T09:11:04Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - Learning Ability of Interpolating Deep Convolutional Neural Networks [28.437011792990347]
We study the learning ability of an important family of deep neural networks, deep convolutional neural networks (DCNNs)
We show that by adding well-defined layers to a non-interpolating DCNN, we can obtain some interpolating DCNNs that maintain the good learning rates of the non-interpolating DCNN.
Our work provides theoretical verification of how overfitted DCNNs generalize well.
arXiv Detail & Related papers (2022-10-25T17:22:31Z) - Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs.
By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z) - Neuroevolution of a Recurrent Neural Network for Spatial and Working
Memory in a Simulated Robotic Environment [57.91534223695695]
We evolved weights in a biologically plausible recurrent neural network (RNN) using an evolutionary algorithm to replicate the behavior and neural activity observed in rats.
Our method demonstrates how the dynamic activity in evolved RNNs can capture interesting and complex cognitive behavior.
arXiv Detail & Related papers (2021-02-25T02:13:52Z) - Progressive Tandem Learning for Pattern Recognition with Deep Spiking
Neural Networks [80.15411508088522]
Spiking neural networks (SNNs) have shown advantages over traditional artificial neural networks (ANNs) for low latency and high computational efficiency.
We propose a novel ANN-to-SNN conversion and layer-wise learning framework for rapid and efficient pattern recognition.
arXiv Detail & Related papers (2020-07-02T15:38:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.