Recurrent Memory Transformer
- URL: http://arxiv.org/abs/2207.06881v1
- Date: Thu, 14 Jul 2022 13:00:22 GMT
- Title: Recurrent Memory Transformer
- Authors: Aydar Bulatov, Yuri Kuratov and Mikhail S. Burtsev
- Abstract summary: We study a memory-augmented segment-level recurrent Transformer (Recurrent Memory Transformer)
We implement a memory mechanism with no changes to Transformer model by adding special memory tokens to the input or output sequence.
Our model performs on par with the Transformer-XL on language modeling for smaller memory sizes and outperforms it for tasks that require longer sequence processing.
- Score: 0.3529736140137003
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based models show their effectiveness across multiple domains and
tasks. The self-attention allows to combine information from all sequence
elements into context-aware representations. However, global and local
information has to be stored mostly in the same element-wise representations.
Moreover, the length of an input sequence is limited by quadratic computational
complexity of self-attention.
In this work, we propose and study a memory-augmented segment-level recurrent
Transformer (Recurrent Memory Transformer). Memory allows to store and process
local and global information as well as to pass information between segments of
the long sequence with the help of recurrence. We implement a memory mechanism
with no changes to Transformer model by adding special memory tokens to the
input or output sequence. Then Transformer is trained to control both memory
operations and sequence representations processing.
Results of experiments show that our model performs on par with the
Transformer-XL on language modeling for smaller memory sizes and outperforms it
for tasks that require longer sequence processing. We show that adding memory
tokens to Tr-XL is able to improve it performance. This makes Recurrent Memory
Transformer a promising architecture for applications that require learning of
long-term dependencies and general purpose in memory processing, such as
algorithmic tasks and reasoning.
Related papers
- Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs.
By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z) - Scaling Transformer to 1M tokens and beyond with RMT [5.60052250541419]
A major limitation for the broader scope of problems solvable by transformers is the quadratic scaling of computational complexity with input size.
In this study, we investigate the recurrent memory augmentation of pre-trained transformer models to extend input context length while linearly scaling compute.
Our approach demonstrates the capability to store information in memory for sequences of up to an unprecedented two million tokens while maintaining high retrieval accuracy.
arXiv Detail & Related papers (2023-04-19T16:18:54Z) - Token Turing Machines [53.22971546637947]
Token Turing Machines (TTM) is a sequential, autoregressive Transformer model with memory for real-world sequential visual understanding.
Our model is inspired by the seminal Neural Turing Machine, and has an external memory consisting of a set of tokens which summarise the previous history.
arXiv Detail & Related papers (2022-11-16T18:59:18Z) - LaMemo: Language Modeling with Look-Ahead Memory [50.6248714811912]
We propose Look-Ahead Memory (LaMemo) that enhances the recurrence memory by incrementally attending to the right-side tokens.
LaMemo embraces bi-directional attention and segment recurrence with an additional overhead only linearly proportional to the memory length.
Experiments on widely used language modeling benchmarks demonstrate its superiority over the baselines equipped with different types of memory.
arXiv Detail & Related papers (2022-04-15T06:11:25Z) - Linearizing Transformer with Key-Value Memory Bank [54.83663647680612]
We propose MemSizer, an approach to project the source sequence into lower dimension representation.
MemSizer not only achieves the same linear time complexity but also enjoys efficient recurrent-style autoregressive generation.
We demonstrate that MemSizer provides an improved tradeoff between efficiency and accuracy over the vanilla transformer.
arXiv Detail & Related papers (2022-03-23T18:10:18Z) - Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers.
Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training.
Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z) - Memory Transformer [0.31406146587437894]
Transformer-based models have achieved state-of-the-art results in many natural language processing tasks.
Memory-augmented neural networks (MANNs) extend traditional neural architectures with general-purpose memory for representations.
We evaluate these memory augmented Transformers and demonstrate that presence of memory positively correlates with the model performance.
arXiv Detail & Related papers (2020-06-20T09:06:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.