Related papers: Dense X Retrieval: What Retrieval Granularity Should We Use?

Dense X Retrieval: What Retrieval Granularity Should We Use?

URL: http://arxiv.org/abs/2312.06648v2
Date: Tue, 12 Dec 2023 03:37:59 GMT
Title: Dense X Retrieval: What Retrieval Granularity Should We Use?
Authors: Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Hongming Zhang, Dong Yu
Abstract summary: Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We introduce a novel retrieval unit, proposition, for dense retrieval. Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid. Our results reveal that proposition-based retrieval significantly outperforms traditional passage or sentence-based methods in dense retrieval.
Score: 59.359325855708974
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dense retrieval has become a prominent method to obtain relevant context or world knowledge in open-domain NLP tasks. When we use a learned dense retriever on a retrieval corpus at inference time, an often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We discover that the retrieval unit choice significantly impacts the performance of both retrieval and downstream tasks. Distinct from the typical approach of using passages or sentences, we introduce a novel retrieval unit, proposition, for dense retrieval. Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format. We conduct an empirical comparison of different retrieval granularity. Our results reveal that proposition-based retrieval significantly outperforms traditional passage or sentence-based methods in dense retrieval. Moreover, retrieval by proposition also enhances the performance of downstream QA tasks, since the retrieved texts are more condensed with question-relevant information, reducing the need for lengthy input tokens and minimizing the inclusion of extraneous, irrelevant information.

Related papers

ELITE: Embedding-Less retrieval with Iterative Text Exploration [5.8851517822935335]
Large Language Models (LLMs) have achieved impressive progress in natural language processing.<n>Their limited ability to retain long-term context constrains performance on document-level or multi-turn tasks.
arXiv Detail & Related papers (2025-05-17T08:48:43Z)
Likelihood as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that likelihoods serve as an effective gauge for language model performance. We propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z)
Improve Dense Passage Retrieval with Entailment Tuning [22.39221206192245]
Key to a retrieval system is to calculate relevance scores to query and passage pairs. We observed that a major class of relevance aligns with the concept of entailment in NLI tasks. We design a method called entailment tuning to improve the embedding of dense retrievers.
arXiv Detail & Related papers (2024-10-21T09:18:30Z)
Revisiting Sparse Retrieval for Few-shot Entity Linking [33.15662306409253]
We propose an ELECTRA-based keyword extractor to denoise the mention context and construct a better query expression. For training the extractor, we propose a distant supervision method to automatically generate training data based on overlapping tokens between mention contexts and entity descriptions. Experimental results on the ZESHEL dataset demonstrate that the proposed method outperforms state-of-the-art models by a significant margin across all test domains.
arXiv Detail & Related papers (2023-10-19T03:51:10Z)
Augmented Embeddings for Custom Retrievals [13.773007276544913]
We introduce Adapted Dense Retrieval, a mechanism to transform embeddings to enable improved task-specific, heterogeneous and strict retrieval. Dense Retrieval works by learning a low-rank residual adaptation of the pretrained black-box embedding.
arXiv Detail & Related papers (2023-10-09T03:29:35Z)
Bridging the Training-Inference Gap for Dense Phrase Retrieval [104.4836127502683]
Building dense retrievers requires a series of standard procedures, including training and validating neural models. In this paper, we explore how the gap between training and inference in dense retrieval can be reduced. We propose an efficient way of validating dense retrievers using a small subset of the entire corpus.
arXiv Detail & Related papers (2022-10-25T00:53:06Z)
UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query. Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms. We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z)
Phrase Retrieval Learns Passage Retrieval, Too [77.57208968326422]
We study whether phrase retrieval can serve as the basis for coarse-level retrieval including passages and documents. We show that a dense phrase-retrieval system, without any retraining, already achieves better passage retrieval accuracy. We also show that phrase filtering and vector quantization can reduce the size of our index by 4-10x.
arXiv Detail & Related papers (2021-09-16T17:42:45Z)
Generation-Augmented Retrieval for Open-domain Question Answering [134.27768711201202]
Generation-Augmented Retrieval (GAR) for answering open-domain questions. We show that generating diverse contexts for a query is beneficial as fusing their results consistently yields better retrieval accuracy. GAR achieves state-of-the-art performance on Natural Questions and TriviaQA datasets under the extractive QA setup when equipped with an extractive reader.
arXiv Detail & Related papers (2020-09-17T23:08:01Z)
Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering [87.32442219333046]
We propose a simple and resource-efficient method to pretrain the paragraph encoder. Our method outperforms an existing dense retrieval method that uses 7 times more computational resources for pretraining.
arXiv Detail & Related papers (2020-04-30T18:09:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.