GMAT: Global Memory Augmentation for Transformers
- URL: http://arxiv.org/abs/2006.03274v1
- Date: Fri, 5 Jun 2020 07:50:40 GMT
- Title: GMAT: Global Memory Augmentation for Transformers
- Authors: Ankit Gupta, Jonathan Berant
- Abstract summary: We propose to augment sparse Transformer blocks with a dense attention-based $textitglobal memory$ of length $M$ ($ll L$)
Our augmentation has a manageable $O(Mcdot(L+M))$ memory overhead, and can be seamlessly integrated with prior sparse solutions.
- Score: 45.584411593847406
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based models have become ubiquitous in natural language
processing thanks to their large capacity, innate parallelism and high
performance. The contextualizing component of a Transformer block is the
$\textit{pairwise dot-product}$ attention that has a large $\Omega(L^2)$ memory
requirement for length $L$ sequences, limiting its ability to process long
documents. This has been the subject of substantial interest recently, where
multiple approximations were proposed to reduce the quadratic memory
requirement using sparse attention matrices. In this work, we propose to
augment sparse Transformer blocks with a dense attention-based $\textit{global
memory}$ of length $M$ ($\ll L$) which provides an aggregate global view of the
entire input sequence to each position. Our augmentation has a manageable
$O(M\cdot(L+M))$ memory overhead, and can be seamlessly integrated with prior
sparse solutions. Moreover, global memory can also be used for sequence
compression, by representing a long input sequence with the memory
representations only. We empirically show that our method leads to substantial
improvement on a range of tasks, including (a) synthetic tasks that require
global reasoning, (b) masked language modeling, and (c) reading comprehension.
Related papers
- Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs.
By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z) - Recurrent Memory Transformer [0.3529736140137003]
We study a memory-augmented segment-level recurrent Transformer (Recurrent Memory Transformer)
We implement a memory mechanism with no changes to Transformer model by adding special memory tokens to the input or output sequence.
Our model performs on par with the Transformer-XL on language modeling for smaller memory sizes and outperforms it for tasks that require longer sequence processing.
arXiv Detail & Related papers (2022-07-14T13:00:22Z) - ABC: Attention with Bounded-memory Control [67.40631793251997]
We show that bounded-memory control (ABC) can be subsumed into one abstraction, attention with bounded-memory control (ABC)
ABC reveals new, unexplored possibilities. First, it connects several efficient attention variants that would otherwise seem apart.
Last, we present a new instance of ABC, which draws inspiration from existing ABC approaches, but replaces their memory-organizing functions with a learned, contextualized one.
arXiv Detail & Related papers (2021-10-06T03:53:25Z) - Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity.
We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z) - Sub-Linear Memory: How to Make Performers SLiM [38.068090269482425]
vanilla Transformers require $O(L2)$ in serial time and memory as functions of input length $L$.
Recent works proposed various linear self-attention mechanisms, scaling only as $O(L)$ for serial computation.
We observe a remarkable computational flexibility: forward and backward propagation can be performed with no approximations using sublinear memory.
arXiv Detail & Related papers (2020-12-21T13:56:04Z) - Memory Transformer [0.31406146587437894]
Transformer-based models have achieved state-of-the-art results in many natural language processing tasks.
Memory-augmented neural networks (MANNs) extend traditional neural architectures with general-purpose memory for representations.
We evaluate these memory augmented Transformers and demonstrate that presence of memory positively correlates with the model performance.
arXiv Detail & Related papers (2020-06-20T09:06:27Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.