Related papers: APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding

APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding

URL: http://arxiv.org/abs/2502.05431v2
Date: Wed, 12 Feb 2025 13:54:01 GMT
Title: APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding
Authors: Xinyu Yang, Tianqi Chen, Beidi Chen,
Abstract summary: We show how parallel encoding can be used to solve contexts-augmented generation problems.<n>APE can preserve 98% and 93% sequential encoding performance using the same inputs.<n>It also scales to many-shot CAG, effectively encoding hundreds of contexts in parallel.
Score: 21.428355295838845
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Context-augmented generation (CAG) techniques, including RAG and ICL, require the efficient combination of multiple contexts to generate responses to user queries. Directly inputting these contexts as a sequence introduces a considerable computational burden by re-encoding the combined selection of contexts for every request. To address this, we explore the promising potential of parallel encoding to independently pre-compute and cache each context's KV states. This approach enables the direct loading of cached states during inference while accommodating more contexts through position reuse across contexts. However, due to misalignments in attention distribution, directly applying parallel encoding results in a significant performance drop. To enable effective and efficient CAG, we propose Adaptive Parallel Encoding ($\textbf{APE}$), which brings shared prefix, attention temperature, and scaling factor to align the distribution of parallel encoding with sequential encoding. Results on RAG and ICL tasks demonstrate that APE can preserve 98% and 93% sequential encoding performance using the same inputs while outperforming parallel encoding by 3.6% and 7.9%, respectively. It also scales to many-shot CAG, effectively encoding hundreds of contexts in parallel. Efficiency evaluation shows that APE can achieve an end-to-end 4.5$\times$ speedup by reducing 28$\times$ prefilling time for a 128K-length context.

Related papers

Learning Adaptive Parallel Reasoning with Language Models [70.1745752819628]
We propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end. APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations. A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures.
arXiv Detail & Related papers (2025-04-21T22:29:02Z)
ParallelComp: Parallel Long-Context Compressor for Length Extrapolation [51.68913021512016]
Extrapolating ultra-long contexts (text length >128K) remains a major challenge for large language models (LLMs)<n>In this work, we propose ParallelComp, a parallel long-context compression method that effectively overcomes the memory bottleneck.<n>We achieve a 1.76x improvement in chunk throughput, thereby achieving a 23.50x acceleration in the prefill stage with negligible performance loss.
arXiv Detail & Related papers (2025-02-20T07:10:43Z)
Efficient Long Context Language Model Retrieval with Compression [57.09163579304332]
Long Context Language Models (LCLMs) have emerged as a new paradigm to perform Information Retrieval (IR)<n>We propose a new compression approach tailored for LCLM retrieval, which is trained to maximize the retrieval performance while minimizing the length of the compressed passages.<n>We show that CoLoR improves the retrieval performance by 6% while compressing the in-context size by a factor of 1.91.
arXiv Detail & Related papers (2024-12-24T07:30:55Z)
EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation [8.757777529568383]
Current RAG systems often struggle when retrieval models fail to rank the most relevant documents. We introduce EXIT, an extractive context compression framework. Our evaluations show that EXIT consistently surpasses existing compression methods.
arXiv Detail & Related papers (2024-12-17T05:38:27Z)
Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement [12.40683763019276]
Large language models (LLMs) often face a bottleneck in inference speed due to their reliance on auto-regressive decoding. We have identified two key issues with existing parallel decoding frameworks. We propose Cerberus, an adaptive parallel decoding framework.
arXiv Detail & Related papers (2024-10-17T08:55:18Z)
Let the Code LLM Edit Itself When You Edit the Code [50.46536185784169]
underlinetextbfPositional textbfIntegrity textbfEncoding (PIE) PIE reduces computational overhead by over 85% compared to the standard full recomputation approach. Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach.
arXiv Detail & Related papers (2024-07-03T14:34:03Z)
KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches [52.02764371205856]
Long context capability is a crucial competency for large language models (LLMs) This work provides a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks.
arXiv Detail & Related papers (2024-07-01T17:59:47Z)
LoCoCo: Dropping In Convolutions for Long Context Compression [77.26610232994508]
This paper presents a novel approach, Dropping In Convolutions for Long Context Compression (LoCoCo) LoCoCo employs only a fixed-size Key-Value ( KV) cache, and can enhance efficiency in both inference and fine-tuning stages.
arXiv Detail & Related papers (2024-06-08T01:35:11Z)
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours. Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z)
Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection [28.15184715270483]
Large language models (LLMs) augmented with retrieval exhibit robust performance and extensive versatility. We propose a novel paradigm named Sparse RAG, which seeks to cut costs through sparsity. Sparse RAG encodes retrieved documents in parallel, which eliminates latency introduced by long-range attention of retrieved documents.
arXiv Detail & Related papers (2024-05-25T11:10:04Z)
Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs [39.16152482491236]
Bifurcated attention is a method designed to enhance language model inference in shared-context batch decoding scenarios. Our approach addresses the challenge of redundant memory IO costs, a critical factor contributing to latency in high batch sizes and extended context lengths.
arXiv Detail & Related papers (2024-03-13T16:30:57Z)
Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster [61.83949316226113]
FastCoT is a model-agnostic framework based on parallel decoding. We show that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach.
arXiv Detail & Related papers (2023-11-14T15:56:18Z)
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving [31.766738294505767]
CacheGen is a fast context-loading module for large language models. Uses a custom tensor encoder to encode a KV cache into compact bitstream representations. adapts the compression level of different parts of a KV cache to cope with changes in available bandwidth.
arXiv Detail & Related papers (2023-10-11T07:08:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.