Noisy Self-Training with Synthetic Queries for Dense Retrieval
- URL: http://arxiv.org/abs/2311.15563v1
- Date: Mon, 27 Nov 2023 06:19:50 GMT
- Title: Noisy Self-Training with Synthetic Queries for Dense Retrieval
- Authors: Fan Jiang, Tom Drummond, Trevor Cohn
- Abstract summary: We introduce a novel noisy self-training framework combined with synthetic queries.
Experimental results show that our method improves consistently over existing methods.
Our method is data efficient and outperforms competitive baselines.
- Score: 49.49928764695172
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although existing neural retrieval models reveal promising results when
training data is abundant and the performance keeps improving as training data
increases, collecting high-quality annotated data is prohibitively costly. To
this end, we introduce a novel noisy self-training framework combined with
synthetic queries, showing that neural retrievers can be improved in a
self-evolution manner with no reliance on any external models. Experimental
results show that our method improves consistently over existing methods on
both general-domain (e.g., MS-MARCO) and out-of-domain (i.e., BEIR) retrieval
benchmarks. Extra analysis on low-resource settings reveals that our method is
data efficient and outperforms competitive baselines, with as little as 30% of
labelled training data. Further extending the framework for reranker training
demonstrates that the proposed method is general and yields additional gains on
tasks of diverse domains.\footnote{Source code is available at
\url{https://github.com/Fantabulous-J/Self-Training-DPR}}
Related papers
- Weak Reward Model Transforms Generative Models into Robust Causal Event Extraction Systems [17.10762463903638]
We train evaluation models to approximate human evaluation, achieving high agreement.
We propose a weak-to-strong supervision method that uses a fraction of the annotated data to train an evaluation model.
arXiv Detail & Related papers (2024-06-26T10:48:14Z) - DUQGen: Effective Unsupervised Domain Adaptation of Neural Rankers by Diversifying Synthetic Query Generation [8.661419320202787]
State-of-the-art neural rankers pre-trained on large task-specific training data such as MS-MARCO, have been shown to exhibit strong performance on various ranking tasks without domain adaptation, also called zero-shot.
We propose a new approach to unsupervised domain adaptation for ranking, DUQGen, which addresses a critical gap in prior literature.
arXiv Detail & Related papers (2024-04-03T05:50:42Z) - Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in
Dense Encoders [63.28408887247742]
We study whether training procedures can be improved to yield better generalization capabilities in the resulting models.
We recommend a simple recipe for training dense encoders: Train on MSMARCO with parameter-efficient methods, such as LoRA, and opt for using in-batch negatives unless given well-constructed hard negatives.
arXiv Detail & Related papers (2023-11-16T10:42:58Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - Learning Fast Sample Re-weighting Without Reward Data [41.92662851886547]
This paper presents a novel learning-based fast sample re-weighting (FSR) method that does not require additional reward data.
Our experiments show the proposed method achieves competitive results compared to state of the arts on label noise and long-tailed recognition.
arXiv Detail & Related papers (2021-09-07T17:30:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.