Block-Attention for Efficient RAG
- URL: http://arxiv.org/abs/2409.15355v4
- Date: Thu, 17 Oct 2024 15:27:30 GMT
- Title: Block-Attention for Efficient RAG
- Authors: East Sun, Yan Wang, Lan Tian,
- Abstract summary: Block-Attention addresses the increased inference latency and cost in Retrieval-Augmented Generation (RAG) scenarios.
By defining each passage as a block, Block-Attention enables us to reuse the KV states of passages that have been seen before.
Experiments on four RAG benchmarks demonstrate that after block fine-tuning, the Block-Attention model achieves performance comparable to self-attention models.
- Score: 3.926246435703829
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Block-Attention, an attention mechanism designed to address the increased inference latency and cost in Retrieval-Augmented Generation (RAG) scenarios. Traditional approaches often encode the entire context. Instead, Block-Attention divides retrieved documents into discrete blocks, with each block independently calculating key-value (KV) states except for the final block. In RAG scenarios, by defining each passage as a block, Block-Attention enables us to reuse the KV states of passages that have been seen before, thereby significantly reducing the latency and the computation overhead during inference. The implementation of Block-Attention involves block segmentation, position re-encoding, and fine-tuning the LLM to adapt to the Block-Attention mechanism. Experiments on four RAG benchmarks demonstrate that after block fine-tuning, the Block-Attention model achieves performance comparable to self-attention models (68.4\% vs 67.9\% on Llama3) or even superior performance (62.8\% vs 59.6\% on Mistral). Notably, Block-Attention significantly reduces the time to first token (TTFT) and floating point operations (FLOPs) to a very low level. It only takes 45 ms to output the first token for an input sequence with a total length of 32K. Compared to the self-attention models, the time consumption and corresponding FLOPs are reduced by 98.7\% and 99.8\%, respectively.
Related papers
- Next Block Prediction: Video Generation via Semi-Autoregressive Modeling [92.60177942930946]
Next-Block Prediction (NBP) is a semi-autoregressive (semi-AR) framework for video generation.
NBP employs bidirectional attention within each block, enabling tokens to capture more robust spatial dependencies.
Our model achieves FVD scores of 103.3 on UCF101 and 25.5 on K600, outperforming the vanilla NTP model by an average of 4.4.
arXiv Detail & Related papers (2025-02-11T17:57:53Z) - BCQ: Block Clustered Quantization for 4-bit (W4A4) LLM Inference [8.136601122570347]
Post-training quantization (PTQ) is a promising approach to reducing the storage and computational requirements of large language models (LLMs) without additional training cost.
Recent PTQ studies have primarily focused on quantizing only weights to sub-8-bits while maintaining activations at 8-bits or higher.
arXiv Detail & Related papers (2025-02-07T23:06:03Z) - FlexiGPT: Pruning and Extending Large Language Models with Low-Rank Weight Sharing [59.12511498024836]
We present a method to prune large language models (LLMs) that selectively prunes model blocks based on an importance score.
We propose a principled metric to replace each pruned block using a weight-sharing mechanism.
Empirical evaluations demonstrate substantial performance gains over existing methods.
arXiv Detail & Related papers (2025-01-24T18:46:37Z) - Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks.
We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems.
Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z) - Improved Block Merging for 3D Point Cloud Instance Segmentation [6.632158868486343]
The proposed work improves over the state-of-the-art by allowing wrongly labelled points of already processed blocks to be corrected through label propagation.
Our experiments show that the proposed block merging algorithm significantly and consistently improves the obtained accuracy for all evaluation metrics employed in literature.
arXiv Detail & Related papers (2024-07-09T16:06:34Z) - Towards Universal Dense Blocking for Entity Resolution [49.06313308481536]
We propose UniBlocker, a dense blocker that is pre-trained on a domain-independent, easily-obtainable corpus.
By conducting domain-independent pre-training, UniBlocker can be adapted to various downstream blocking scenarios without requiring domain-specific fine-tuning.
Our experiments show that the proposed UniBlocker, without any domain-specific learning, significantly outperforms previous self- and unsupervised dense blocking methods.
arXiv Detail & Related papers (2024-04-23T08:39:29Z) - Accurate Block Quantization in LLMs with Outliers [0.6138671548064355]
The demand for inference on extremely large scale LLMs has seen enormous growth in recent months.
The problem is aggravated by the exploding raise in the lengths of the sequences being processed.
Various quantization techniques have been proposed that allow accurate quantization for both weights and activations.
arXiv Detail & Related papers (2024-03-29T12:15:06Z) - Constant Memory Attention Block [74.38724530521277]
Constant Memory Attention Block (CMAB) is a novel general-purpose attention block that computes its output in constant memory and performs updates in constant computation.
We show our proposed methods achieve results competitive with state-of-the-art while being significantly more memory efficient.
arXiv Detail & Related papers (2023-06-21T22:41:58Z) - SC-Block: Supervised Contrastive Blocking within Entity Resolution
Pipelines [75.5113002732746]
This paper presents SC-Block, a blocking method that utilizes supervised contrastive learning for positioning records in the embedding space.
We benchmark SC-Block against eight state-of-the-art blocking methods.
For measuring the overall runtime, we determine candidate sets with 99.5% pair completeness and pass them to the matcher.
arXiv Detail & Related papers (2023-03-06T13:49:41Z) - Algorithm to Compilation Co-design: An Integrated View of Neural Network
Sparsity [0.8566457170664925]
We apply structured and unstructured pruning to attention weights of transformer blocks of the BERT language model.
We study relationships between modeling decisions and their direct impact on sparsity-enhanced execution.
arXiv Detail & Related papers (2021-06-16T15:13:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.