On Single and Multiple Representations in Dense Passage Retrieval
- URL: http://arxiv.org/abs/2108.06279v1
- Date: Fri, 13 Aug 2021 15:01:53 GMT
- Title: On Single and Multiple Representations in Dense Passage Retrieval
- Authors: Craig Macdonald, Nicola Tonellotto, Iadh Ounis
- Abstract summary: Two dense retrieval families have become apparent: single representation and multiple representation.
This paper contributes a direct study on their comparative effectiveness, noting situations where each method under/over performs w.r.t. each other, and w.r.t. a BM25 baseline.
We also show that multiple representations obtain better improvements than single representations for queries that are the hardest for BM25, as well as for definitional queries.
- Score: 30.303705563808386
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The advent of contextualised language models has brought gains in search
effectiveness, not just when applied for re-ranking the output of classical
weighting models such as BM25, but also when used directly for passage indexing
and retrieval, a technique which is called dense retrieval. In the existing
literature in neural ranking, two dense retrieval families have become
apparent: single representation, where entire passages are represented by a
single embedding (usually BERT's [CLS] token, as exemplified by the recent ANCE
approach), or multiple representations, where each token in a passage is
represented by its own embedding (as exemplified by the recent ColBERT
approach). These two families have not been directly compared. However, because
of the likely importance of dense retrieval moving forward, a clear
understanding of their advantages and disadvantages is paramount. To this end,
this paper contributes a direct study on their comparative effectiveness,
noting situations where each method under/over performs w.r.t. each other, and
w.r.t. a BM25 baseline. We observe that, while ANCE is more efficient than
ColBERT in terms of response time and memory usage, multiple representations
are statistically more effective than the single representations for MAP and
MRR@10. We also show that multiple representations obtain better improvements
than single representations for queries that are the hardest for BM25, as well
as for definitional queries, and those with complex information needs.
Related papers
- Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Search [65.53881294642451]
Deliberate Thinking based Dense Retriever (DEBATER)
DEBATER enhances recent dense retrievers by enabling them to learn more effective document representations through a step-by-step thinking process.
Experimental results show that DEBATER significantly outperforms existing methods across several retrieval benchmarks.
arXiv Detail & Related papers (2025-02-18T15:56:34Z) - Lexically-Accelerated Dense Retrieval [29.327878974130055]
'LADR' (Lexically-Accelerated Dense Retrieval) is a simple-yet-effective approach that improves the efficiency of existing dense retrieval models.
LADR consistently achieves both precision and recall that are on par with an exhaustive search on standard benchmarks.
arXiv Detail & Related papers (2023-07-31T15:44:26Z) - BERM: Training the Balanced and Extractable Representation for Matching
to Improve Generalization Ability of Dense Retrieval [54.66399120084227]
We propose a novel method to improve the generalization of dense retrieval via capturing matching signal called BERM.
Dense retrieval has shown promise in the first-stage retrieval process when trained on in-domain labeled datasets.
arXiv Detail & Related papers (2023-05-18T15:43:09Z) - Query2doc: Query Expansion with Large Language Models [69.9707552694766]
The proposed method first generates pseudo- documents by few-shot prompting large language models (LLMs)
query2doc boosts the performance of BM25 by 3% to 15% on ad-hoc IR datasets.
Our method also benefits state-of-the-art dense retrievers in terms of both in-domain and out-of-domain results.
arXiv Detail & Related papers (2023-03-14T07:27:30Z) - CITADEL: Conditional Token Interaction via Dynamic Lexical Routing for
Efficient and Effective Multi-Vector Retrieval [72.90850213615427]
Multi-vector retrieval methods combine the merits of sparse (e.g. BM25) and dense (e.g. DPR) retrievers.
These methods are orders of magnitude slower and need much more space to store their indices compared to their single-vector counterparts.
We propose conditional token interaction via dynamic lexical routing, namely CITADEL, for efficient and effective multi-vector retrieval.
arXiv Detail & Related papers (2022-11-18T18:27:35Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - CODER: An efficient framework for improving retrieval through
COntextualized Document Embedding Reranking [11.635294568328625]
We present a framework for improving the performance of a wide class of retrieval models at minimal computational cost.
It utilizes precomputed document representations extracted by a base dense retrieval method.
It incurs a negligible computational overhead on top of any first-stage method at run time, allowing it to be easily combined with any state-of-the-art dense retrieval method.
arXiv Detail & Related papers (2021-12-16T10:25:26Z) - A Replication Study of Dense Passage Retriever [32.192420072129636]
We study the dense passage retriever (DPR) technique proposed by Karpukhin et al. ( 2020) for end-to-end open-domain question answering.
We present a replication study of this work, starting with model checkpoints provided by the authors.
We are able to improve end-to-end question answering effectiveness using exactly the same models as in the original work.
arXiv Detail & Related papers (2021-04-12T18:10:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.