Context Embeddings for Efficient Answer Generation in RAG
- URL: http://arxiv.org/abs/2407.09252v3
- Date: Tue, 29 Oct 2024 17:34:54 GMT
- Title: Context Embeddings for Efficient Answer Generation in RAG
- Authors: David Rau, Shuai Wang, Hervé Déjean, Stéphane Clinchant,
- Abstract summary: We present COCOM, an effective context compression method, reducing long contexts to only a handful of Context Embeddings.
Our method demonstrates a speed-up of up to 5.69 $times$ while achieving higher performance compared to existing efficient context compression methods.
- Score: 10.702520553261756
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Retrieval-Augmented Generation (RAG) allows overcoming the limited knowledge of LLMs by extending the input with external information. As a consequence, the contextual inputs to the model become much longer which slows down decoding time directly translating to the time a user has to wait for an answer. We address this challenge by presenting COCOM, an effective context compression method, reducing long contexts to only a handful of Context Embeddings speeding up the generation time by a large margin. Our method allows for different compression rates trading off decoding time for answer quality. Compared to earlier methods, COCOM allows for handling multiple contexts more effectively, significantly reducing decoding time for long inputs. Our method demonstrates a speed-up of up to 5.69 $\times$ while achieving higher performance compared to existing efficient context compression methods.
Related papers
- LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z) - Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference [16.830389144259584]
We propose context-aware prompt compression (CPC), a sentence-level prompt compression technique.
Key innovation is a novel context-aware sentence encoder that provides a relevance score for each sentence for a given question.
Our method considerably outperforms prior works on prompt compression on benchmark datasets.
arXiv Detail & Related papers (2024-09-02T13:02:51Z) - Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression [7.673616185468932]
Supplying irrelevant context to large language models can result in poorer responses, increased inference latency, and higher costs.
This paper introduces a method called Instruction-Aware Contextual Compression, which filters out less informative content.
The experimental results demonstrate that Instruction-Aware Contextual Compression notably reduces memory consumption and minimizes generation latency.
arXiv Detail & Related papers (2024-08-28T02:31:15Z) - UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs [111.12010207132204]
UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings.
We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm.
UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
arXiv Detail & Related papers (2024-06-26T08:44:36Z) - In-Context Former: Lightning-fast Compressing Context for Large Language Model [48.831304302467004]
In this paper, we propose a new approach to compress the long input contexts of Transformer-based large language models (LLMs)
We use the cross-attention mechanism and a small number of learnable digest tokens to condense information from the contextual word embeddings.
Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times.
arXiv Detail & Related papers (2024-06-19T15:14:55Z) - Recurrent Context Compression: Efficiently Expanding the Context Window of LLM [22.595457889113668]
This work introduces a method called Recurrent Context Compression (RCC), designed to efficiently expand the context window length of Transformer-based large language models (LLMs)
We validated our approach on multiple tasks, achieving a compression rate of up to 32x on text reconstruction tasks with a BLEU4 score close to 0.95, and nearly 100% accuracy on a passkey retrieval task with a sequence length of 1M.
arXiv Detail & Related papers (2024-06-10T08:50:59Z) - Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - LLoCO: Learning Long Contexts Offline [63.3458260335454]
We propose LLoCO, a novel approach to processing long contexts.
LLoCO learns contexts offline through context compression and in-domain parameter-efficient finetuning with LoRA.
Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens.
arXiv Detail & Related papers (2024-04-11T17:57:22Z) - Extending Context Window of Large Language Models via Semantic
Compression [21.35020344956721]
Large Language Models (LLMs) often impose limitations on the length of the text input to ensure the generation of fluent and relevant responses.
We propose a novel semantic compression method that enables generalization to texts 6-8 times longer, without incurring significant computational costs or requiring fine-tuning.
arXiv Detail & Related papers (2023-12-15T07:04:33Z) - Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster [61.83949316226113]
FastCoT is a model-agnostic framework based on parallel decoding.
We show that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach.
arXiv Detail & Related papers (2023-11-14T15:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.