AugTriever: Unsupervised Dense Retrieval and Domain Adaptation by Scalable Data Augmentation
- URL: http://arxiv.org/abs/2212.08841v4
- Date: Wed, 30 Oct 2024 02:36:38 GMT
- Title: AugTriever: Unsupervised Dense Retrieval and Domain Adaptation by Scalable Data Augmentation
- Authors: Rui Meng, Ye Liu, Semih Yavuz, Divyansh Agarwal, Lifu Tu, Ning Yu, Jianguo Zhang, Meghana Bhat, Yingbo Zhou,
- Abstract summary: We propose two approaches that enable annotation-free and scalable training by creating pseudo querydocument pairs.
The query extraction method involves selecting salient spans from the original document to generate pseudo queries.
The transferred query generation method utilizes generation models trained for other NLP tasks, such as summarization, to produce pseudo queries.
- Score: 44.93777271276723
- License:
- Abstract: Dense retrievers have made significant strides in text retrieval and open-domain question answering. However, most of these achievements have relied heavily on extensive human-annotated supervision. In this study, we aim to develop unsupervised methods for improving dense retrieval models. We propose two approaches that enable annotation-free and scalable training by creating pseudo querydocument pairs: query extraction and transferred query generation. The query extraction method involves selecting salient spans from the original document to generate pseudo queries. On the other hand, the transferred query generation method utilizes generation models trained for other NLP tasks, such as summarization, to produce pseudo queries. Through extensive experimentation, we demonstrate that models trained using these augmentation methods can achieve comparable, if not better, performance than multiple strong dense baselines. Moreover, combining these strategies leads to further improvements, resulting in superior performance of unsupervised dense retrieval, unsupervised domain adaptation and supervised finetuning, benchmarked on both BEIR and ODQA datasets. Code and datasets are publicly available at https://github.com/salesforce/AugTriever.
Related papers
- Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Search [65.53881294642451]
Deliberate Thinking based Dense Retriever (DEBATER)
DEBATER enhances recent dense retrievers by enabling them to learn more effective document representations through a step-by-step thinking process.
Experimental results show that DEBATER significantly outperforms existing methods across several retrieval benchmarks.
arXiv Detail & Related papers (2025-02-18T15:56:34Z) - Reinforced Information Retrieval [35.0424269986952]
We present textbfReinforced-IR, a novel approach that jointly adapts a pre-trained retriever and generator for precise cross-domain retrieval.
A key innovation of Reinforced-IR is its textbfSelf-Boosting framework, which enables retriever and generator to learn from each other's feedback.
In our experiment, Reinforced-IR outperforms existing domain adaptation methods by a large margin, leading to substantial improvements in retrieval quality across a wide range of application scenarios.
arXiv Detail & Related papers (2025-02-17T08:52:39Z) - Chain-of-Retrieval Augmented Generation [72.06205327186069]
This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer.
Our proposed method, CoRAG, allows the model to dynamically reformulate the query based on the evolving state.
arXiv Detail & Related papers (2025-01-24T09:12:52Z) - Unsupervised Query Routing for Retrieval Augmented Generation [64.47987041500966]
We introduce a novel unsupervised method that constructs the "upper-bound" response to evaluate the quality of retrieval-augmented responses.
This evaluation enables the decision of the most suitable search engine for a given query.
By eliminating manual annotations, our approach can automatically process large-scale real user queries and create training data.
arXiv Detail & Related papers (2025-01-14T02:27:06Z) - MBA-RAG: a Bandit Approach for Adaptive Retrieval-Augmented Generation through Question Complexity [30.346398341996476]
We propose a reinforcement learning-based framework that dynamically selects the most suitable retrieval strategy based on query complexity.
Our method achieves new state of the art results on multiple single-hop and multi-hop datasets while reducing retrieval costs.
arXiv Detail & Related papers (2024-12-02T14:55:02Z) - Corrective Retrieval Augmented Generation [36.04062963574603]
Retrieval-augmented generation (RAG) relies heavily on relevance of retrieved documents, raising concerns about how the model behaves if retrieval goes wrong.
We propose the Corrective Retrieval Augmented Generation (CRAG) to improve the robustness of generation.
CRAG is plug-and-play and can be seamlessly coupled with various RAG-based approaches.
arXiv Detail & Related papers (2024-01-29T04:36:39Z) - Noisy Self-Training with Synthetic Queries for Dense Retrieval [49.49928764695172]
We introduce a novel noisy self-training framework combined with synthetic queries.
Experimental results show that our method improves consistently over existing methods.
Our method is data efficient and outperforms competitive baselines.
arXiv Detail & Related papers (2023-11-27T06:19:50Z) - DORE: Document Ordered Relation Extraction based on Generative Framework [56.537386636819626]
This paper investigates the root cause of the underwhelming performance of the existing generative DocRE models.
We propose to generate a symbolic and ordered sequence from the relation matrix which is deterministic and easier for model to learn.
Experimental results on four datasets show that our proposed method can improve the performance of the generative DocRE models.
arXiv Detail & Related papers (2022-10-28T11:18:10Z) - GQE-PRF: Generative Query Expansion with Pseudo-Relevance Feedback [8.142861977776256]
We propose a novel approach which effectively integrates text generation models into PRF-based query expansion.
Our approach generates augmented query terms via neural text generation models conditioned on both the initial query and pseudo-relevance feedback.
We evaluate the performance of our approach on information retrieval tasks using two benchmark datasets.
arXiv Detail & Related papers (2021-08-13T01:09:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.