Memory Transformer
- URL: http://arxiv.org/abs/2006.11527v2
- Date: Tue, 16 Feb 2021 08:06:47 GMT
- Title: Memory Transformer
- Authors: Mikhail S. Burtsev, Yuri Kuratov, Anton Peganov, Grigory V. Sapunov
- Abstract summary: Transformer-based models have achieved state-of-the-art results in many natural language processing tasks.
Memory-augmented neural networks (MANNs) extend traditional neural architectures with general-purpose memory for representations.
We evaluate these memory augmented Transformers and demonstrate that presence of memory positively correlates with the model performance.
- Score: 0.31406146587437894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based models have achieved state-of-the-art results in many
natural language processing tasks. The self-attention architecture allows
transformer to combine information from all elements of a sequence into
context-aware representations. However, information about the context is stored
mostly in the same element-wise representations. This might limit the
processing of properties related to the sequence as a whole more difficult.
Adding trainable memory to selectively store local as well as global
representations of a sequence is a promising direction to improve the
Transformer model. Memory-augmented neural networks (MANNs) extend traditional
neural architectures with general-purpose memory for representations. MANNs
have demonstrated the capability to learn simple algorithms like Copy or
Reverse and can be successfully trained via backpropagation on diverse tasks
from question answering to language modeling outperforming RNNs and LSTMs of
comparable complexity. In this work, we propose and study few extensions of the
Transformer baseline (1) by adding memory tokens to store non-local
representations, (2) creating memory bottleneck for the global information, (3)
controlling memory update with dedicated layer. We evaluate these memory
augmented Transformers and demonstrate that presence of memory positively
correlates with the model performance for machine translation and language
modelling tasks. Augmentation of pre-trained masked language model with memory
tokens shows mixed results for tasks from GLUE benchmark. Visualization of
attention patterns over the memory suggest that it improves the model's ability
to process a global context.
Related papers
- Enhancing Length Extrapolation in Sequential Models with Pointer-Augmented Neural Memory [66.88278207591294]
We propose Pointer-Augmented Neural Memory (PANM) to help neural networks understand and apply symbol processing to new, longer sequences of data.
PANM integrates an external neural memory that uses novel physical addresses and pointer manipulation techniques to mimic human and computer symbol processing abilities.
arXiv Detail & Related papers (2024-04-18T03:03:46Z) - Cached Transformers: Improving Transformers with Differentiable Memory
Cache [71.28188777209034]
This work introduces a new Transformer model called Cached Transformer.
It uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens.
arXiv Detail & Related papers (2023-12-20T03:30:51Z) - Recurrent Memory Transformer [0.3529736140137003]
We study a memory-augmented segment-level recurrent Transformer (Recurrent Memory Transformer)
We implement a memory mechanism with no changes to Transformer model by adding special memory tokens to the input or output sequence.
Our model performs on par with the Transformer-XL on language modeling for smaller memory sizes and outperforms it for tasks that require longer sequence processing.
arXiv Detail & Related papers (2022-07-14T13:00:22Z) - LaMemo: Language Modeling with Look-Ahead Memory [50.6248714811912]
We propose Look-Ahead Memory (LaMemo) that enhances the recurrence memory by incrementally attending to the right-side tokens.
LaMemo embraces bi-directional attention and segment recurrence with an additional overhead only linearly proportional to the memory length.
Experiments on widely used language modeling benchmarks demonstrate its superiority over the baselines equipped with different types of memory.
arXiv Detail & Related papers (2022-04-15T06:11:25Z) - Mention Memory: incorporating textual knowledge into Transformers
through entity mention attention [21.361822569279003]
We propose to integrate a semi-parametric representation of a large text corpus into a Transformer model as a source of factual knowledge.
The proposed model - TOME - is a Transformer that accesses the information through internal memory layers in which each entity mention in the input passage attends to the mention memory.
In experiments using a memory of 150 million Wikipedia mentions, TOME achieves strong performance on several open-domain knowledge-intensive tasks.
arXiv Detail & Related papers (2021-10-12T17:19:05Z) - Language Modeling using LMUs: 10x Better Data Efficiency or Improved
Scaling Compared to Transformers [4.899818550820576]
We construct a Legendre Memory Unit based model that introduces a general prior for sequence processing.
We show that our new architecture attains the same accuracy as transformers with 10x fewer tokens.
arXiv Detail & Related papers (2021-10-05T23:20:37Z) - GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z) - Addressing Some Limitations of Transformers with Feedback Memory [51.94640029417114]
Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks.
We propose the Feedback Transformer architecture that exposes all previous representations to all future representations.
We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.
arXiv Detail & Related papers (2020-02-21T16:37:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.