Block-Attention for Efficient RAG
- URL: http://arxiv.org/abs/2409.15355v4
- Date: Thu, 17 Oct 2024 15:27:30 GMT
- Title: Block-Attention for Efficient RAG
- Authors: East Sun, Yan Wang, Lan Tian,
- Abstract summary: Block-Attention addresses the increased inference latency and cost in Retrieval-Augmented Generation (RAG) scenarios.
By defining each passage as a block, Block-Attention enables us to reuse the KV states of passages that have been seen before.
Experiments on four RAG benchmarks demonstrate that after block fine-tuning, the Block-Attention model achieves performance comparable to self-attention models.
- Score: 3.926246435703829
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Block-Attention, an attention mechanism designed to address the increased inference latency and cost in Retrieval-Augmented Generation (RAG) scenarios. Traditional approaches often encode the entire context. Instead, Block-Attention divides retrieved documents into discrete blocks, with each block independently calculating key-value (KV) states except for the final block. In RAG scenarios, by defining each passage as a block, Block-Attention enables us to reuse the KV states of passages that have been seen before, thereby significantly reducing the latency and the computation overhead during inference. The implementation of Block-Attention involves block segmentation, position re-encoding, and fine-tuning the LLM to adapt to the Block-Attention mechanism. Experiments on four RAG benchmarks demonstrate that after block fine-tuning, the Block-Attention model achieves performance comparable to self-attention models (68.4\% vs 67.9\% on Llama3) or even superior performance (62.8\% vs 59.6\% on Mistral). Notably, Block-Attention significantly reduces the time to first token (TTFT) and floating point operations (FLOPs) to a very low level. It only takes 45 ms to output the first token for an input sequence with a total length of 32K. Compared to the self-attention models, the time consumption and corresponding FLOPs are reduced by 98.7\% and 99.8\%, respectively.
Related papers
- Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks.
We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems.
Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z) - Realizing Unaligned Block-wise Pruning for DNN Acceleration on Mobile Devices [1.6114012813668932]
Block-wise pruning is promising due to its low accuracy drop tradeoff for speedup gains.
Unaligned block pruning (UBP) addresses this by allowing blocks to be selected at arbitrary positions.
We propose a pseudo-optimal yet fast block selection algorithm called Block Expansion and Division.
arXiv Detail & Related papers (2024-07-29T01:59:06Z) - Improved Block Merging for 3D Point Cloud Instance Segmentation [6.632158868486343]
The proposed work improves over the state-of-the-art by allowing wrongly labelled points of already processed blocks to be corrected through label propagation.
Our experiments show that the proposed block merging algorithm significantly and consistently improves the obtained accuracy for all evaluation metrics employed in literature.
arXiv Detail & Related papers (2024-07-09T16:06:34Z) - Towards Universal Dense Blocking for Entity Resolution [49.06313308481536]
We propose UniBlocker, a dense blocker that is pre-trained on a domain-independent, easily-obtainable corpus.
By conducting domain-independent pre-training, UniBlocker can be adapted to various downstream blocking scenarios without requiring domain-specific fine-tuning.
Our experiments show that the proposed UniBlocker, without any domain-specific learning, significantly outperforms previous self- and unsupervised dense blocking methods.
arXiv Detail & Related papers (2024-04-23T08:39:29Z) - Accurate Block Quantization in LLMs with Outliers [0.6138671548064355]
The demand for inference on extremely large scale LLMs has seen enormous growth in recent months.
The problem is aggravated by the exploding raise in the lengths of the sequences being processed.
Various quantization techniques have been proposed that allow accurate quantization for both weights and activations.
arXiv Detail & Related papers (2024-03-29T12:15:06Z) - CBQ: Cross-Block Quantization for Large Language Models [66.82132832702895]
Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs.
We propose CBQ, a cross-block reconstruction-based PTQ method for LLMs.
CBQ employs a cross-block dependency using a reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation.
arXiv Detail & Related papers (2023-12-13T07:56:27Z) - Constant Memory Attention Block [74.38724530521277]
Constant Memory Attention Block (CMAB) is a novel general-purpose attention block that computes its output in constant memory and performs updates in constant computation.
We show our proposed methods achieve results competitive with state-of-the-art while being significantly more memory efficient.
arXiv Detail & Related papers (2023-06-21T22:41:58Z) - SC-Block: Supervised Contrastive Blocking within Entity Resolution
Pipelines [75.5113002732746]
This paper presents SC-Block, a blocking method that utilizes supervised contrastive learning for positioning records in the embedding space.
We benchmark SC-Block against eight state-of-the-art blocking methods.
For measuring the overall runtime, we determine candidate sets with 99.5% pair completeness and pass them to the matcher.
arXiv Detail & Related papers (2023-03-06T13:49:41Z) - Self-Supervised Learning of Perceptually Optimized Block Motion
Estimates for Video Compression [50.48504867843605]
We propose a search-free block motion estimation framework using a multi-stage convolutional neural network.
We deploy the multi-scale structural similarity (MS-SSIM) loss function to optimize the perceptual quality of the motion compensated predicted frames.
arXiv Detail & Related papers (2021-10-05T03:38:43Z) - Algorithm to Compilation Co-design: An Integrated View of Neural Network
Sparsity [0.8566457170664925]
We apply structured and unstructured pruning to attention weights of transformer blocks of the BERT language model.
We study relationships between modeling decisions and their direct impact on sparsity-enhanced execution.
arXiv Detail & Related papers (2021-06-16T15:13:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.