Attendre: Wait To Attend By Retrieval With Evicted Queries in
Memory-Based Transformers for Long Context Processing
- URL: http://arxiv.org/abs/2401.04881v1
- Date: Wed, 10 Jan 2024 02:20:48 GMT
- Title: Attendre: Wait To Attend By Retrieval With Evicted Queries in
Memory-Based Transformers for Long Context Processing
- Authors: Zi Yang, Nan Hua
- Abstract summary: One effective approach is to use a FIFO memory to store keys and values of an attention sublayer from past chunks to allow subsequent queries to attend.
We propose to use eviction policies, such as LRA and LFA, to reduce the memory size and adapt to various architectures.
We also propose the Attendre layer, a wait-to-attend mechanism by retrieving the key-value memory with evicted queries in the query memory.
- Score: 2.9733429388858714
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As LLMs have become capable of processing more complex types of inputs,
researchers have recently studied how to efficiently and affordably process
possibly arbitrarily long sequences. One effective approach is to use a FIFO
memory to store keys and values of an attention sublayer from past chunks to
allow subsequent queries to attend. However, this approach requires a large
memory and/or takes into the consideration the specific LM architecture.
Moreover, due to the causal nature between the key-values in prior context and
the queries at present, this approach cannot be extended to bidirectional
attention such as in an encoder-decoder or PrefixLM decoder-only architecture.
In this paper, we propose to use eviction policies, such as LRA and LFA, to
reduce the memory size and adapt to various architectures, and we also propose
the Attendre layer, a wait-to-attend mechanism by retrieving the key-value
memory (K/V memory) with evicted queries in the query memory (Q memory). As a
first step, we evaluate this method in the context length extension setup using
the TriviaQA reading comprehension task, and show the effectiveness of the
approach.
Related papers
- MeMSVD: Long-Range Temporal Structure Capturing Using Incremental SVD [27.472705540825316]
This paper is on long-term video understanding where the goal is to recognise human actions over long temporal windows (up to minutes long)
We propose an alternative to attention-based schemes which is based on a low-rank approximation of the memory obtained using Singular Value Decomposition.
Our scheme has two advantages: (a) it reduces complexity by more than an order of magnitude, and (b) it is amenable to an efficient implementation for the calculation of the memory bases.
arXiv Detail & Related papers (2024-06-11T12:03:57Z) - User Intent Recognition and Semantic Cache Optimization-Based Query Processing Framework using CFLIS and MGR-LAU [0.0]
This work analyzed the informational, navigational, and transactional-based intents in queries for enhanced QP.
For efficient QP, the data is structured using Epanechnikov Kernel-Ordering Points To Identify the Clustering Structure (EK-OPTICS)
The extracted features, detected intents and structured data are inputted to the Multi-head Gated Recurrent Learnable Attention Unit (MGR-LAU)
arXiv Detail & Related papers (2024-06-06T20:28:05Z) - When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively [3.705145020383824]
We show how Large Language Models (LLMs) can learn to use an off-the-shelf information retrieval (IR) system specifically when additional context is required to answer a given question.
arXiv Detail & Related papers (2024-04-30T16:52:55Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - Anchor-based Large Language Models [33.86392289481657]
This study introduces Anchor-based LLMs (AnLLMs), which utilize an anchor-based self-attention network (AnSAN) and also an anchor-based inference strategy.
AnLLMs maintain similar accuracy levels while achieving up to 99% keys/values cache reduction and up to 3.5 times faster inference.
arXiv Detail & Related papers (2024-02-12T12:48:02Z) - Walking Down the Memory Maze: Beyond Context Limit through Interactive
Reading [63.93888816206071]
We introduce MemWalker, a method that processes the long context into a tree of summary nodes. Upon receiving a query, the model navigates this tree in search of relevant information, and responds once it gathers sufficient information.
We show that, beyond effective reading, MemWalker enhances explainability by highlighting the reasoning steps as it interactively reads the text; pinpointing the relevant text segments related to the query.
arXiv Detail & Related papers (2023-10-08T06:18:14Z) - Temporal-aware Hierarchical Mask Classification for Video Semantic
Segmentation [62.275143240798236]
Video semantic segmentation dataset has limited categories per video.
Less than 10% of queries could be matched to receive meaningful gradient updates during VSS training.
Our method achieves state-of-the-art performance on the latest challenging VSS benchmark VSPW without bells and whistles.
arXiv Detail & Related papers (2023-09-14T20:31:06Z) - Towards Model-Size Agnostic, Compute-Free, Memorization-based Inference
of Deep Learning [5.41530201129053]
This paper proposes a novel memorization-based inference (MBI) that is compute free and only requires lookups.
Specifically, our work capitalizes on the inference mechanism of the recurrent attention model (RAM)
By leveraging the low-dimensionality of glimpse, our inference procedure stores key value pairs comprising of glimpse location, patch vector, etc. in a table.
The computations are obviated during inference by utilizing the table to read out key-value pairs and performing compute-free inference by memorization.
arXiv Detail & Related papers (2023-07-14T21:01:59Z) - Enhancing Large Language Model with Self-Controlled Memory Framework [56.38025154501917]
Large Language Models (LLMs) are constrained by their inability to process lengthy inputs, resulting in the loss of critical historical information.
We propose the Self-Controlled Memory (SCM) framework to enhance the ability of LLMs to maintain long-term memory and recall relevant information.
arXiv Detail & Related papers (2023-04-26T07:25:31Z) - Sequential Recommender via Time-aware Attentive Memory Network [67.26862011527986]
We propose a temporal gating methodology to improve attention mechanism and recurrent units.
We also propose a Multi-hop Time-aware Attentive Memory network to integrate long-term and short-term preferences.
Our approach is scalable for candidate retrieval tasks and can be viewed as a non-linear generalization of latent factorization for dot-product based Top-K recommendation.
arXiv Detail & Related papers (2020-05-18T11:29:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.