Optimizing Retrieval-augmented Reader Models via Token Elimination
- URL: http://arxiv.org/abs/2310.13682v2
- Date: Sun, 5 Nov 2023 06:31:55 GMT
- Title: Optimizing Retrieval-augmented Reader Models via Token Elimination
- Authors: Moshe Berchansky, Peter Izsak, Avi Caciularu, Ido Dagan, Moshe
Wasserblat
- Abstract summary: We analyze the contribution and necessity of all the retrieved passages to the performance of reader models.
We demonstrate that our method can reduce run-time by up to 62.2%, with only a 2% reduction in performance, and in some cases, even improve the performance results.
- Score: 30.53636918279251
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fusion-in-Decoder (FiD) is an effective retrieval-augmented language model
applied across a variety of open-domain tasks, such as question answering, fact
checking, etc. In FiD, supporting passages are first retrieved and then
processed using a generative model (Reader), which can cause a significant
bottleneck in decoding time, particularly with long outputs. In this work, we
analyze the contribution and necessity of all the retrieved passages to the
performance of reader models, and propose eliminating some of the retrieved
information, at the token level, that might not contribute essential
information to the answer generation process. We demonstrate that our method
can reduce run-time by up to 62.2%, with only a 2% reduction in performance,
and in some cases, even improve the performance results.
Related papers
- Static Pruning in Dense Retrieval using Matrix Decomposition [12.899105656025018]
In the era of dense retrieval, document indexing and retrieval is largely based on encoding models that transform text documents into embeddings.
Recent studies have shown that it is possible to reduce embedding size without sacrificing - and in some cases improving - the retrieval effectiveness.
We present a novel static pruning method for reducing the dimensionality of embeddings using Principal Components Analysis.
arXiv Detail & Related papers (2024-12-13T09:09:20Z) - Likelihood as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that likelihoods serve as an effective gauge for language model performance.
We propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z) - Think-then-Act: A Dual-Angle Evaluated Retrieval-Augmented Generation [3.2134014920850364]
Large language models (LLMs) often face challenges such as temporal misalignment and generating hallucinatory content.
We propose a dual-angle evaluated retrieval-augmented generation framework textitThink-then-Act'
arXiv Detail & Related papers (2024-06-18T20:51:34Z) - CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling [52.404072802235234]
We introduce Chunked Instruction-aware State Eviction (CItruS), a novel modeling technique that integrates the attention preferences useful for a downstream task into the eviction process of hidden states.
Our training-free method exhibits superior performance on long sequence comprehension and retrieval tasks over several strong baselines under the same memory budget.
arXiv Detail & Related papers (2024-06-17T18:34:58Z) - Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection [28.15184715270483]
Large language models (LLMs) augmented with retrieval exhibit robust performance and extensive versatility.
We propose a novel paradigm named Sparse RAG, which seeks to cut costs through sparsity.
Sparse RAG encodes retrieved documents in parallel, which eliminates latency introduced by long-range attention of retrieved documents.
arXiv Detail & Related papers (2024-05-25T11:10:04Z) - List-aware Reranking-Truncation Joint Model for Search and
Retrieval-augmented Generation [80.12531449946655]
We propose a Reranking-Truncation joint model (GenRT) that can perform the two tasks concurrently.
GenRT integrates reranking and truncation via generative paradigm based on encoder-decoder architecture.
Our method achieves SOTA performance on both reranking and truncation tasks for web search and retrieval-augmented LLMs.
arXiv Detail & Related papers (2024-02-05T06:52:53Z) - Making Retrieval-Augmented Language Models Robust to Irrelevant Context [55.564789967211844]
An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant.
Recent work has shown that retrieval augmentation can sometimes have a negative effect on performance.
arXiv Detail & Related papers (2023-10-02T18:52:35Z) - Tokenization Consistency Matters for Generative Models on Extractive NLP
Tasks [54.306234256074255]
We identify the issue of tokenization inconsistency that is commonly neglected in training generative models.
This issue damages the extractive nature of these tasks after the input and output are tokenized inconsistently.
We show that, with consistent tokenization, the model performs better in both in-domain and out-of-domain datasets.
arXiv Detail & Related papers (2022-12-19T23:33:21Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.