Context Embeddings for Efficient Answer Generation in RAG
- URL: http://arxiv.org/abs/2407.09252v2
- Date: Tue, 23 Jul 2024 12:28:31 GMT
- Title: Context Embeddings for Efficient Answer Generation in RAG
- Authors: David Rau, Shuai Wang, Hervé Déjean, Stéphane Clinchant,
- Abstract summary: We present COCOM, an effective context compression method, reducing long contexts to only a handful of Context Embeddings.
Our method demonstrates a speed-up of up to 5.69 $times$ while achieving higher performance compared to existing efficient context compression methods.
- Score: 10.702520553261756
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Retrieval-Augmented Generation (RAG) allows overcoming the limited knowledge of LLMs by extending the input with external information. As a consequence, the contextual inputs to the model become much longer which slows down decoding time directly translating to the time a user has to wait for an answer. We address this challenge by presenting COCOM, an effective context compression method, reducing long contexts to only a handful of Context Embeddings speeding up the generation time by a large margin. Our method allows for different compression rates trading off decoding time for answer quality. Compared to earlier methods, COCOM allows for handling multiple contexts more effectively, significantly reducing decoding time for long inputs. Our method demonstrates a speed-up of up to 5.69 $\times$ while achieving higher performance compared to existing efficient context compression methods.
Related papers
- UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs [111.12010207132204]
UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings.
We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm.
UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
arXiv Detail & Related papers (2024-06-26T08:44:36Z) - In-Context Former: Lightning-fast Compressing Context for Large Language Model [48.831304302467004]
In this paper, we propose a new approach to compress the long input contexts of Transformer-based large language models (LLMs)
We use the cross-attention mechanism and a small number of learnable digest tokens to condense information from the contextual word embeddings.
Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times.
arXiv Detail & Related papers (2024-06-19T15:14:55Z) - Recurrent Context Compression: Efficiently Expanding the Context Window of LLM [22.595457889113668]
This work introduces a method called Recurrent Context Compression (RCC), designed to efficiently expand the context window length of Transformer-based large language models (LLMs)
We validated our approach on multiple tasks, achieving a compression rate of up to 32x on text reconstruction tasks with a BLEU4 score close to 0.95, and nearly 100% accuracy on a passkey retrieval task with a sequence length of 1M.
arXiv Detail & Related papers (2024-06-10T08:50:59Z) - Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - LLoCO: Learning Long Contexts Offline [63.3458260335454]
We introduce LLoCO, a technique that combines context compression, retrieval, and parameter-efficient finetuning using LoRA.
We evaluate our approach on several long-context question-answering datasets, demonstrating that LLoCO significantly outperforms in-context learning.
arXiv Detail & Related papers (2024-04-11T17:57:22Z) - Training LLMs over Neurally Compressed Text [55.11828645767342]
This paper explores the idea of training large language models (LLMs) over highly compressed text.
We propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length.
We demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks.
arXiv Detail & Related papers (2024-04-04T17:48:28Z) - Extending Context Window of Large Language Models via Semantic
Compression [21.35020344956721]
Large Language Models (LLMs) often impose limitations on the length of the text input to ensure the generation of fluent and relevant responses.
We propose a novel semantic compression method that enables generalization to texts 6-8 times longer, without incurring significant computational costs or requiring fine-tuning.
arXiv Detail & Related papers (2023-12-15T07:04:33Z) - Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster [61.83949316226113]
FastCoT is a model-agnostic framework based on parallel decoding.
We show that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach.
arXiv Detail & Related papers (2023-11-14T15:56:18Z) - Compressing Context to Enhance Inference Efficiency of Large Language
Models [26.75216730927996]
This paper proposes a method called Selective Context to enhance the inference efficiency of large language models (LLMs)
We test our approach using common data sources requiring long context processing: arXiv papers, news articles, and long conversations.
Experimental results show that Selective Context significantly reduces memory cost and decreases generation latency.
arXiv Detail & Related papers (2023-10-09T23:03:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.