Query-as-context Pre-training for Dense Passage Retrieval
- URL: http://arxiv.org/abs/2212.09598v3
- Date: Sun, 15 Oct 2023 03:43:53 GMT
- Title: Query-as-context Pre-training for Dense Passage Retrieval
- Authors: Xing Wu, Guangyuan Ma, Wanhui Qian, Zijia Lin, Songlin Hu
- Abstract summary: Methods have been developed to improve the performance of dense passage retrieval by using context-supervised pre-training.
This paper proposes query-as-context pre-training, a simple yet effective pre-training technique to alleviate the issue.
- Score: 27.733665432319803
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, methods have been developed to improve the performance of dense
passage retrieval by using context-supervised pre-training. These methods
simply consider two passages from the same document to be relevant, without
taking into account the possibility of weakly correlated pairs. Thus, this
paper proposes query-as-context pre-training, a simple yet effective
pre-training technique to alleviate the issue. Query-as-context pre-training
assumes that the query derived from a passage is more likely to be relevant to
that passage and forms a passage-query pair. These passage-query pairs are then
used in contrastive or generative context-supervised pre-training. The
pre-trained models are evaluated on large-scale passage retrieval benchmarks
and out-of-domain zero-shot benchmarks. Experimental results show that
query-as-context pre-training brings considerable gains and meanwhile speeds up
training, demonstrating its effectiveness and efficiency. Our code will be
available at https://github.com/caskcsg/ir/tree/main/cotmae-qc .
Related papers
- Improve Dense Passage Retrieval with Entailment Tuning [22.39221206192245]
Key to a retrieval system is to calculate relevance scores to query and passage pairs.
We observed that a major class of relevance aligns with the concept of entailment in NLI tasks.
We design a method called entailment tuning to improve the embedding of dense retrievers.
arXiv Detail & Related papers (2024-10-21T09:18:30Z) - Few-shot Prompting for Pairwise Ranking: An Effective Non-Parametric Retrieval Model [18.111868378615206]
We propose a pairwise few-shot ranker that achieves a close performance to that of a supervised model without requiring any complex training pipeline.
Our method also achieves a close performance to that of a supervised model without requiring any complex training pipeline.
arXiv Detail & Related papers (2024-09-26T11:19:09Z) - Bootstrapped Pre-training with Dynamic Identifier Prediction for Generative Retrieval [108.9772640854136]
Generative retrieval uses differentiable search indexes to directly generate relevant document identifiers in response to a query.
Recent studies have highlighted the potential of a strong generative retrieval model, trained with carefully crafted pre-training tasks, to enhance downstream retrieval tasks via fine-tuning.
We introduce BootRet, a bootstrapped pre-training method for generative retrieval that dynamically adjusts document identifiers during pre-training to accommodate the continuing of the corpus.
arXiv Detail & Related papers (2024-07-16T08:42:36Z) - Revisiting and Maximizing Temporal Knowledge in Semi-supervised Semantic Segmentation [7.005068872406135]
Mean Teacher- and co-training-based approaches are employed to mitigate confirmation bias and coupling problems.
These approaches frequently involve complex training pipelines and a substantial computational burden.
We propose a PrevMatch framework that effectively mitigates the limitations by maximizing the utilization of the temporal knowledge obtained during the training process.
arXiv Detail & Related papers (2024-05-31T03:54:59Z) - In-context Pretraining: Language Modeling Beyond Document Boundaries [137.53145699439898]
In-Context Pretraining is a new approach where language models are pretrained on a sequence of related documents.
We introduce approximate algorithms for finding related documents with efficient nearest neighbor search.
We see notable improvements in tasks that require more complex contextual reasoning.
arXiv Detail & Related papers (2023-10-16T17:57:12Z) - Unsupervised Dense Retrieval with Relevance-Aware Contrastive
Pre-Training [81.3781338418574]
We propose relevance-aware contrastive learning.
We consistently improve the SOTA unsupervised Contriever model on the BEIR and open-domain QA retrieval benchmarks.
Our method can not only beat BM25 after further pre-training on the target corpus but also serves as a good few-shot learner.
arXiv Detail & Related papers (2023-06-05T18:20:27Z) - Bridging the Training-Inference Gap for Dense Phrase Retrieval [104.4836127502683]
Building dense retrievers requires a series of standard procedures, including training and validating neural models.
In this paper, we explore how the gap between training and inference in dense retrieval can be reduced.
We propose an efficient way of validating dense retrievers using a small subset of the entire corpus.
arXiv Detail & Related papers (2022-10-25T00:53:06Z) - Hyperlink-induced Pre-training for Passage Retrieval in Open-domain
Question Answering [53.381467950545606]
HyperLink-induced Pre-training (HLP) is a method to pre-train the dense retriever with the text relevance induced by hyperlink-based topology within Web documents.
We demonstrate that the hyperlink-based structures of dual-link and co-mention can provide effective relevance signals for large-scale pre-training.
arXiv Detail & Related papers (2022-03-14T09:09:49Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.