Related papers: Landmark Attention: Random-Access Infinite Context Length for Transformers

Landmark Attention: Random-Access Infinite Context Length for Transformers

URL: http://arxiv.org/abs/2305.16300v2
Date: Mon, 20 Nov 2023 01:16:17 GMT
Title: Landmark Attention: Random-Access Infinite Context Length for Transformers
Authors: Amirkeivan Mohtashami, Martin Jaggi
Abstract summary: We present a novel approach that allows access to the complete context while retaining random-access flexibility. Our method uses a landmark token to represent each block of the input and trains the attention to use it for selecting relevant blocks. We demonstrate that our method can obtain comparable performance with Transformer-XL while significantly reducing the number of retrieved tokens in each step.
Score: 45.69864961773124
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Transformers have shown remarkable success in natural language processing, their attention mechanism's large memory requirements have limited their ability to handle longer contexts. Prior approaches, such as recurrent memory or retrieval-based augmentation, have either compromised the random-access flexibility of attention (i.e., the capability to select any token in the entire context) or relied on separate mechanisms for relevant context retrieval, which may not be compatible with the model's attention. In this paper, we present a novel approach that allows access to the complete context while retaining random-access flexibility, closely resembling running attention on the entire context. Our method uses a landmark token to represent each block of the input and trains the attention to use it for selecting relevant blocks, enabling retrieval of blocks directly through the attention mechanism instead of by relying on a separate mechanism. Our approach seamlessly integrates with specialized data structures and the system's memory hierarchy, enabling processing of arbitrarily long context lengths. We demonstrate that our method can obtain comparable performance with Transformer-XL while significantly reducing the number of retrieved tokens in each step. Finally, we show that fine-tuning LLaMA 7B with our method successfully extends its context length capacity to over 32k tokens, allowing for inference at the context lengths of GPT-4. We release the implementation of landmark attention and the code to reproduce our experiments at https://github.com/epfml/landmark-attention/.

Related papers

Quantifying Memory Utilization with Effective State-Size [73.52115209375343]
We develop a measure of textitmemory utilization' This metric is tailored to the fundamental class of systems with textitinput-invariant and textitinput-varying linear operators
arXiv Detail & Related papers (2025-04-28T08:12:30Z)
Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling [42.67141329779589]
Grouped Cross Attention can generalize to 1000 times the pre-training context length. Experiments show that GCA-based models achieve near-perfect accuracy in passkey retrieval for 16M context lengths.
arXiv Detail & Related papers (2024-10-02T15:18:34Z)
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention [6.713196608291278]
This work introduces an efficient method to scale Transformer-based Large Language Models to infinitely long inputs with bounded memory and computation. A key component in our proposed approach is a new attention technique dubbed Infini-attention.
arXiv Detail & Related papers (2024-04-10T16:18:42Z)
Fovea Transformer: Efficient Long-Context Modeling with Structured Fine-to-Coarse Attention [17.48544285026157]
We introduce Fovea Transformer, a long-context focused transformer. We use representations of context tokens with a progressively coarser granularity in the tree, as their distance to the query token increases. We evaluate our model on three long-context summarization tasks.
arXiv Detail & Related papers (2023-11-13T06:24:27Z)
Walking Down the Memory Maze: Beyond Context Limit through Interactive Reading [63.93888816206071]
We introduce MemWalker, a method that processes the long context into a tree of summary nodes. Upon receiving a query, the model navigates this tree in search of relevant information, and responds once it gathers sufficient information. We show that, beyond effective reading, MemWalker enhances explainability by highlighting the reasoning steps as it interactively reads the text; pinpointing the relevant text segments related to the query.
arXiv Detail & Related papers (2023-10-08T06:18:14Z)
Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices. Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
arXiv Detail & Related papers (2023-10-03T08:44:50Z)
Constant Memory Attention Block [74.38724530521277]
Constant Memory Attention Block (CMAB) is a novel general-purpose attention block that computes its output in constant memory and performs updates in constant computation. We show our proposed methods achieve results competitive with state-of-the-art while being significantly more memory efficient.
arXiv Detail & Related papers (2023-06-21T22:41:58Z)
ABC: Attention with Bounded-memory Control [67.40631793251997]
We show that bounded-memory control (ABC) can be subsumed into one abstraction, attention with bounded-memory control (ABC) ABC reveals new, unexplored possibilities. First, it connects several efficient attention variants that would otherwise seem apart. Last, we present a new instance of ABC, which draws inspiration from existing ABC approaches, but replaces their memory-organizing functions with a learned, contextualized one.
arXiv Detail & Related papers (2021-10-06T03:53:25Z)
Learning Hard Retrieval Decoder Attention for Transformers [69.40942736249397]
Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily. We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
arXiv Detail & Related papers (2020-09-30T13:18:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.