Landmark Attention: Random-Access Infinite Context Length for
Transformers
- URL: http://arxiv.org/abs/2305.16300v2
- Date: Mon, 20 Nov 2023 01:16:17 GMT
- Title: Landmark Attention: Random-Access Infinite Context Length for
Transformers
- Authors: Amirkeivan Mohtashami, Martin Jaggi
- Abstract summary: We present a novel approach that allows access to the complete context while retaining random-access flexibility.
Our method uses a landmark token to represent each block of the input and trains the attention to use it for selecting relevant blocks.
We demonstrate that our method can obtain comparable performance with Transformer-XL while significantly reducing the number of retrieved tokens in each step.
- Score: 45.69864961773124
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Transformers have shown remarkable success in natural language
processing, their attention mechanism's large memory requirements have limited
their ability to handle longer contexts. Prior approaches, such as recurrent
memory or retrieval-based augmentation, have either compromised the
random-access flexibility of attention (i.e., the capability to select any
token in the entire context) or relied on separate mechanisms for relevant
context retrieval, which may not be compatible with the model's attention. In
this paper, we present a novel approach that allows access to the complete
context while retaining random-access flexibility, closely resembling running
attention on the entire context. Our method uses a landmark token to represent
each block of the input and trains the attention to use it for selecting
relevant blocks, enabling retrieval of blocks directly through the attention
mechanism instead of by relying on a separate mechanism. Our approach
seamlessly integrates with specialized data structures and the system's memory
hierarchy, enabling processing of arbitrarily long context lengths. We
demonstrate that our method can obtain comparable performance with
Transformer-XL while significantly reducing the number of retrieved tokens in
each step. Finally, we show that fine-tuning LLaMA 7B with our method
successfully extends its context length capacity to over 32k tokens, allowing
for inference at the context lengths of GPT-4. We release the implementation of
landmark attention and the code to reproduce our experiments at
https://github.com/epfml/landmark-attention/.
Related papers
- Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention [6.713196608291278]
This work introduces an efficient method to scale Transformer-based Large Language Models to infinitely long inputs with bounded memory and computation.
A key component in our proposed approach is a new attention technique dubbed Infini-attention.
arXiv Detail & Related papers (2024-04-10T16:18:42Z) - Fovea Transformer: Efficient Long-Context Modeling with Structured
Fine-to-Coarse Attention [17.48544285026157]
We introduce Fovea Transformer, a long-context focused transformer.
We use representations of context tokens with a progressively coarser granularity in the tree, as their distance to the query token increases.
We evaluate our model on three long-context summarization tasks.
arXiv Detail & Related papers (2023-11-13T06:24:27Z) - Walking Down the Memory Maze: Beyond Context Limit through Interactive
Reading [63.93888816206071]
We introduce MemWalker, a method that processes the long context into a tree of summary nodes. Upon receiving a query, the model navigates this tree in search of relevant information, and responds once it gathers sufficient information.
We show that, beyond effective reading, MemWalker enhances explainability by highlighting the reasoning steps as it interactively reads the text; pinpointing the relevant text segments related to the query.
arXiv Detail & Related papers (2023-10-08T06:18:14Z) - Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices.
Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
arXiv Detail & Related papers (2023-10-03T08:44:50Z) - Constant Memory Attention Block [74.38724530521277]
Constant Memory Attention Block (CMAB) is a novel general-purpose attention block that computes its output in constant memory and performs updates in constant computation.
We show our proposed methods achieve results competitive with state-of-the-art while being significantly more memory efficient.
arXiv Detail & Related papers (2023-06-21T22:41:58Z) - ABC: Attention with Bounded-memory Control [67.40631793251997]
We show that bounded-memory control (ABC) can be subsumed into one abstraction, attention with bounded-memory control (ABC)
ABC reveals new, unexplored possibilities. First, it connects several efficient attention variants that would otherwise seem apart.
Last, we present a new instance of ABC, which draws inspiration from existing ABC approaches, but replaces their memory-organizing functions with a learned, contextualized one.
arXiv Detail & Related papers (2021-10-06T03:53:25Z) - Learning Hard Retrieval Decoder Attention for Transformers [69.40942736249397]
Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily.
We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
arXiv Detail & Related papers (2020-09-30T13:18:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.