Related papers: Attendre: Wait To Attend By Retrieval With Evicted Queries in Memory-Based Transformers for Long Context Processing

Attendre: Wait To Attend By Retrieval With Evicted Queries in Memory-Based Transformers for Long Context Processing

URL: http://arxiv.org/abs/2401.04881v1
Date: Wed, 10 Jan 2024 02:20:48 GMT
Title: Attendre: Wait To Attend By Retrieval With Evicted Queries in Memory-Based Transformers for Long Context Processing
Authors: Zi Yang, Nan Hua
Abstract summary: One effective approach is to use a FIFO memory to store keys and values of an attention sublayer from past chunks to allow subsequent queries to attend. We propose to use eviction policies, such as LRA and LFA, to reduce the memory size and adapt to various architectures. We also propose the Attendre layer, a wait-to-attend mechanism by retrieving the key-value memory with evicted queries in the query memory.
Score: 2.9733429388858714
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As LLMs have become capable of processing more complex types of inputs, researchers have recently studied how to efficiently and affordably process possibly arbitrarily long sequences. One effective approach is to use a FIFO memory to store keys and values of an attention sublayer from past chunks to allow subsequent queries to attend. However, this approach requires a large memory and/or takes into the consideration the specific LM architecture. Moreover, due to the causal nature between the key-values in prior context and the queries at present, this approach cannot be extended to bidirectional attention such as in an encoder-decoder or PrefixLM decoder-only architecture. In this paper, we propose to use eviction policies, such as LRA and LFA, to reduce the memory size and adapt to various architectures, and we also propose the Attendre layer, a wait-to-attend mechanism by retrieving the key-value memory (K/V memory) with evicted queries in the query memory (Q memory). As a first step, we evaluate this method in the context length extension setup using the TriviaQA reading comprehension task, and show the effectiveness of the approach.

Related papers

Cognitive Memory in Large Language Models [8.059261857307881]
This paper examines memory mechanisms in Large Language Models (LLMs), emphasizing their importance for context-rich responses, reduced hallucinations, and improved efficiency. It categorizes memory into sensory, short-term, and long-term, with sensory memory corresponding to input prompts, short-term memory processing immediate context, and long-term memory implemented via external databases or structures.
arXiv Detail & Related papers (2025-04-03T09:58:19Z)
Squeezed Attention: Accelerating Long Context Length LLM Inference [64.11145320159126]
We propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed. We use K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value. We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs.
arXiv Detail & Related papers (2024-11-14T18:54:19Z)
Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism [2.919891871101241]
Transformers have a quadratic scaling of computational complexity with input size. Retrieval-augmented generation (RAG) can better handle longer contexts by using a retrieval system. We introduce a novel approach, Inner Loop Memory Augmented Tree Retrieval (ILM-TR)
arXiv Detail & Related papers (2024-10-11T19:49:05Z)
Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling [53.58854856174773]
Speculative decoding is an approach to accelerate inference through a guess-and-verify paradigm. Token Recycling stores candidate tokens in an adjacency matrix and employs a breadth-first search algorithm. It significantly outperforms existing train-free methods by 30% and even a training method by 25%.
arXiv Detail & Related papers (2024-08-16T12:20:56Z)
Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens. Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z)
MeMSVD: Long-Range Temporal Structure Capturing Using Incremental SVD [27.472705540825316]
This paper is on long-term video understanding where the goal is to recognise human actions over long temporal windows (up to minutes long) We propose an alternative to attention-based schemes which is based on a low-rank approximation of the memory obtained using Singular Value Decomposition. Our scheme has two advantages: (a) it reduces complexity by more than an order of magnitude, and (b) it is amenable to an efficient implementation for the calculation of the memory bases.
arXiv Detail & Related papers (2024-06-11T12:03:57Z)
When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively [3.705145020383824]
We show how Large Language Models (LLMs) can learn to use an off-the-shelf information retrieval (IR) system specifically when additional context is required to answer a given question.
arXiv Detail & Related papers (2024-04-30T16:52:55Z)
CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z)
Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models. HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks. A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z)
Anchor-based Large Language Models [33.86392289481657]
This study introduces Anchor-based LLMs (AnLLMs), which utilize an anchor-based self-attention network (AnSAN) and also an anchor-based inference strategy. AnLLMs maintain similar accuracy levels while achieving up to 99% keys/values cache reduction and up to 3.5 times faster inference.
arXiv Detail & Related papers (2024-02-12T12:48:02Z)
Walking Down the Memory Maze: Beyond Context Limit through Interactive Reading [63.93888816206071]
We introduce MemWalker, a method that processes the long context into a tree of summary nodes. Upon receiving a query, the model navigates this tree in search of relevant information, and responds once it gathers sufficient information. We show that, beyond effective reading, MemWalker enhances explainability by highlighting the reasoning steps as it interactively reads the text; pinpointing the relevant text segments related to the query.
arXiv Detail & Related papers (2023-10-08T06:18:14Z)
Towards Model-Size Agnostic, Compute-Free, Memorization-based Inference of Deep Learning [5.41530201129053]
This paper proposes a novel memorization-based inference (MBI) that is compute free and only requires lookups. Specifically, our work capitalizes on the inference mechanism of the recurrent attention model (RAM) By leveraging the low-dimensionality of glimpse, our inference procedure stores key value pairs comprising of glimpse location, patch vector, etc. in a table. The computations are obviated during inference by utilizing the table to read out key-value pairs and performing compute-free inference by memorization.
arXiv Detail & Related papers (2023-07-14T21:01:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.