Noisy Self-Training with Synthetic Queries for Dense Retrieval
- URL: http://arxiv.org/abs/2311.15563v1
- Date: Mon, 27 Nov 2023 06:19:50 GMT
- Title: Noisy Self-Training with Synthetic Queries for Dense Retrieval
- Authors: Fan Jiang, Tom Drummond, Trevor Cohn
- Abstract summary: We introduce a novel noisy self-training framework combined with synthetic queries.
Experimental results show that our method improves consistently over existing methods.
Our method is data efficient and outperforms competitive baselines.
- Score: 49.49928764695172
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although existing neural retrieval models reveal promising results when
training data is abundant and the performance keeps improving as training data
increases, collecting high-quality annotated data is prohibitively costly. To
this end, we introduce a novel noisy self-training framework combined with
synthetic queries, showing that neural retrievers can be improved in a
self-evolution manner with no reliance on any external models. Experimental
results show that our method improves consistently over existing methods on
both general-domain (e.g., MS-MARCO) and out-of-domain (i.e., BEIR) retrieval
benchmarks. Extra analysis on low-resource settings reveals that our method is
data efficient and outperforms competitive baselines, with as little as 30% of
labelled training data. Further extending the framework for reranker training
demonstrates that the proposed method is general and yields additional gains on
tasks of diverse domains.\footnote{Source code is available at
\url{https://github.com/Fantabulous-J/Self-Training-DPR}}
Related papers
- DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks [40.91931801667421]
This paper presents a novel global-to-local algorithm called DUET that can exploit the feedback loop by interleaving a data selection method with Bayesian optimization.
As a result, DUET can efficiently refine the training data mixture from a pool of data domains to maximize the model's performance on the unseen evaluation task.
arXiv Detail & Related papers (2025-02-01T01:52:32Z) - Federated Unlearning Model Recovery in Data with Skewed Label Distributions [10.236494861079779]
This paper proposes a recovery method of federated unlearning with skewed label distributions.
We first adopt a strategy that incorporates oversampling with deep learning to supplement the skewed class data.
Then, a density-based denoising method is applied to remove noise from the generated data.
All the remaining clients leverage the enhanced local datasets and engage in iterative training to effectively restore the performance of the unlearning model.
arXiv Detail & Related papers (2024-12-18T03:25:11Z) - Multi-Armed Bandit Approach for Optimizing Training on Synthetic Data [7.603659241572307]
We propose a novel UCB-based training procedure combined with a dynamic usability metric.
Our proposed metric integrates low-level and high-level information from synthetic images and their corresponding real and synthetic datasets.
We show that our metric is an effective way to rank synthetic images based on their usability.
arXiv Detail & Related papers (2024-12-06T23:36:36Z) - Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in
Dense Encoders [63.28408887247742]
We study whether training procedures can be improved to yield better generalization capabilities in the resulting models.
We recommend a simple recipe for training dense encoders: Train on MSMARCO with parameter-efficient methods, such as LoRA, and opt for using in-batch negatives unless given well-constructed hard negatives.
arXiv Detail & Related papers (2023-11-16T10:42:58Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - Learning Fast Sample Re-weighting Without Reward Data [41.92662851886547]
This paper presents a novel learning-based fast sample re-weighting (FSR) method that does not require additional reward data.
Our experiments show the proposed method achieves competitive results compared to state of the arts on label noise and long-tailed recognition.
arXiv Detail & Related papers (2021-09-07T17:30:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.