QUITO: Accelerating Long-Context Reasoning through Query-Guided Context Compression
- URL: http://arxiv.org/abs/2408.00274v1
- Date: Thu, 1 Aug 2024 04:28:38 GMT
- Title: QUITO: Accelerating Long-Context Reasoning through Query-Guided Context Compression
- Authors: Wenshan Wang, Yihang Wang, Yixing Fan, Huaming Liao, Jiafeng Guo,
- Abstract summary: In this paper, we introduce a novel Query-gUIded aTtention cOmpression (QUITO) method to filter useless information.
Specifically, we take a trigger token to calculate the attention distribution of the context in response to the question.
We evaluate the QUITO using two widely-used datasets, namely, NaturalQuestions and ASQA.
- Score: 37.08536175557748
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In-context learning (ICL) capabilities are foundational to the success of large language models (LLMs). Recently, context compression has attracted growing interest since it can largely reduce reasoning complexities and computation costs of LLMs. In this paper, we introduce a novel Query-gUIded aTtention cOmpression (QUITO) method, which leverages attention of the question over the contexts to filter useless information. Specifically, we take a trigger token to calculate the attention distribution of the context in response to the question. Based on the distribution, we propose three different filtering methods to satisfy the budget constraints of the context length. We evaluate the QUITO using two widely-used datasets, namely, NaturalQuestions and ASQA. Experimental results demonstrate that QUITO significantly outperforms established baselines across various datasets and downstream LLMs, underscoring its effectiveness. Our code is available at https://github.com/Wenshansilvia/attention_compressor.
Related papers
- Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? [36.83397306207386]
We evaluate the capabilities of 17 leading Large Language Models (LLMs)
Strikingly, many models are remarkably threadsafe: capable of simultaneously following multiple threads without significant loss in performance.
We find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows.
arXiv Detail & Related papers (2024-11-07T18:59:27Z) - What is Wrong with Perplexity for Long-context Language Modeling? [71.34933096461124]
Long-context inputs are crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning.
Perplexity (PPL) has proven unreliable for assessing long-context capabilities.
We propose bfLongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them.
arXiv Detail & Related papers (2024-10-31T09:39:28Z) - Discovering the Gems in Early Layers: Accelerating Long-Context LLMs
with 1000x Input Token Reduction [47.38471103190534]
Large Language Models (LLMs) have demonstrated remarkable capabilities in handling long context inputs, but this comes at the cost of increased computational resources and latency.
Our research introduces a novel approach for the long context bottleneck to accelerate LLM inference and reduce GPU memory consumption.
We propose an algorithm that uses early layers of an LLM as filters to select and compress input tokens, significantly reducing the context length for subsequent processing.
arXiv Detail & Related papers (2024-09-25T23:14:47Z) - DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels [89.51834016940153]
We introduce DetectiveQA, a narrative reasoning benchmark with an average context length of over 100K tokens.
We use detective novels as data sources, which naturally have various reasoning elements.
We manually annotated 600 questions in Chinese and then also provided an English edition of the context information and questions.
arXiv Detail & Related papers (2024-09-04T06:28:22Z) - Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference [16.830389144259584]
We propose context-aware prompt compression (CPC), a sentence-level prompt compression technique.
Key innovation is a novel context-aware sentence encoder that provides a relevance score for each sentence for a given question.
Our method considerably outperforms prior works on prompt compression on benchmark datasets.
arXiv Detail & Related papers (2024-09-02T13:02:51Z) - QUITO-X: An Information Bottleneck-based Compression Algorithm with Cross-Attention [37.25151458038128]
We introduce information bottleneck theory to examine the properties required by the metric.
Inspired by this, we use cross-attention in encoder-decoder architecture as a new metric.
Our simple method leads to significantly better performance in smaller models with lower latency.
arXiv Detail & Related papers (2024-08-20T02:44:45Z) - In-Context Former: Lightning-fast Compressing Context for Large Language Model [48.831304302467004]
In this paper, we propose a new approach to compress the long input contexts of Transformer-based large language models (LLMs)
We use the cross-attention mechanism and a small number of learnable digest tokens to condense information from the contextual word embeddings.
Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times.
arXiv Detail & Related papers (2024-06-19T15:14:55Z) - LLoCO: Learning Long Contexts Offline [63.3458260335454]
We propose LLoCO, a novel approach to processing long contexts.
LLoCO learns contexts offline through context compression and in-domain parameter-efficient finetuning with LoRA.
Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens.
arXiv Detail & Related papers (2024-04-11T17:57:22Z) - Allies: Prompting Large Language Model with Beam Search [107.38790111856761]
In this work, we propose a novel method called ALLIES.
Given an input query, ALLIES leverages LLMs to iteratively generate new queries related to the original query.
By iteratively refining and expanding the scope of the original query, ALLIES captures and utilizes hidden knowledge that may not be directly through retrieval.
arXiv Detail & Related papers (2023-05-24T06:16:44Z) - Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of
LLMs with Self-Information-Based Content Filtering [4.1372815372396525]
This paper proposes a method called textitSelective Context that employs self-information to filter out less informative content.
We demonstrate the effectiveness of our approach on tasks of summarisation and question answering across different data sources.
arXiv Detail & Related papers (2023-04-24T13:55:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.