ABC: Attention with Bounded-memory Control
- URL: http://arxiv.org/abs/2110.02488v1
- Date: Wed, 6 Oct 2021 03:53:25 GMT
- Title: ABC: Attention with Bounded-memory Control
- Authors: Hao Peng, Jungo Kasai, Nikolaos Pappas, Dani Yogatama, Zhaofeng Wu,
Lingpeng Kong, Roy Schwartz, Noah A. Smith
- Abstract summary: We show that bounded-memory control (ABC) can be subsumed into one abstraction, attention with bounded-memory control (ABC)
ABC reveals new, unexplored possibilities. First, it connects several efficient attention variants that would otherwise seem apart.
Last, we present a new instance of ABC, which draws inspiration from existing ABC approaches, but replaces their memory-organizing functions with a learned, contextualized one.
- Score: 67.40631793251997
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer architectures have achieved state-of-the-art results on a variety
of sequence modeling tasks. However, their attention mechanism comes with a
quadratic complexity in sequence lengths, making the computational overhead
prohibitive, especially for long sequences. Attention context can be seen as a
random-access memory with each token taking a slot. Under this perspective, the
memory size grows linearly with the sequence length, and so does the overhead
of reading from it. One way to improve the efficiency is to bound the memory
size. We show that disparate approaches can be subsumed into one abstraction,
attention with bounded-memory control (ABC), and they vary in their
organization of the memory. ABC reveals new, unexplored possibilities. First,
it connects several efficient attention variants that would otherwise seem
apart. Second, this abstraction gives new insights--an established approach
(Wang et al., 2020b) previously thought to be not applicable in causal
attention, actually is. Last, we present a new instance of ABC, which draws
inspiration from existing ABC approaches, but replaces their heuristic
memory-organizing functions with a learned, contextualized one. Our experiments
on language modeling, machine translation, and masked language model finetuning
show that our approach outperforms previous efficient attention models;
compared to the strong transformer baselines, it significantly improves the
inference time and space efficiency with no or negligible accuracy loss.
Related papers
- Building, Reusing, and Generalizing Abstract Representations from Concrete Sequences [51.965994405124455]
Humans excel at learning abstract patterns across different sequences, filtering out irrelevant details.
Many sequence learning models lack the ability to abstract, which leads to memory inefficiency and poor transfer.
We introduce a non-parametric hierarchical variable learning model (HVM) that learns chunks from sequences and abstracts contextually similar chunks as variables.
arXiv Detail & Related papers (2024-10-27T18:13:07Z) - Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention [6.713196608291278]
This work introduces an efficient method to scale Transformer-based Large Language Models to infinitely long inputs with bounded memory and computation.
A key component in our proposed approach is a new attention technique dubbed Infini-attention.
arXiv Detail & Related papers (2024-04-10T16:18:42Z) - Simple linear attention language models balance the recall-throughput
tradeoff [40.08746299497935]
We propose BASED, a simple architecture combining linear and sliding window attention.
We train language models up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models in perplexity and outperforms them on real-world recall-intensive tasks by 6.22 accuracy points.
arXiv Detail & Related papers (2024-02-28T19:28:27Z) - Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs.
By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z) - Landmark Attention: Random-Access Infinite Context Length for
Transformers [45.69864961773124]
We present a novel approach that allows access to the complete context while retaining random-access flexibility.
Our method uses a landmark token to represent each block of the input and trains the attention to use it for selecting relevant blocks.
We demonstrate that our method can obtain comparable performance with Transformer-XL while significantly reducing the number of retrieved tokens in each step.
arXiv Detail & Related papers (2023-05-25T17:53:42Z) - LaMemo: Language Modeling with Look-Ahead Memory [50.6248714811912]
We propose Look-Ahead Memory (LaMemo) that enhances the recurrence memory by incrementally attending to the right-side tokens.
LaMemo embraces bi-directional attention and segment recurrence with an additional overhead only linearly proportional to the memory length.
Experiments on widely used language modeling benchmarks demonstrate its superiority over the baselines equipped with different types of memory.
arXiv Detail & Related papers (2022-04-15T06:11:25Z) - Solving Continuous Control with Episodic Memory [1.9493449206135294]
Episodic memory lets reinforcement learning algorithms remember and exploit promising experience from the past to improve agent performance.
Our study aims to answer the question: can episodic memory be used to improve agent's performance in continuous control?
arXiv Detail & Related papers (2021-06-16T14:51:39Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z) - SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive
Connection [51.376723069962]
We present a method for accelerating and structuring self-attentions: Sparse Adaptive Connection.
In SAC, we regard the input sequence as a graph and attention operations are performed between linked nodes.
We show that SAC is competitive with state-of-the-art models while significantly reducing memory cost.
arXiv Detail & Related papers (2020-03-22T07:58:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.