Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in
Dense Encoders
- URL: http://arxiv.org/abs/2311.09765v1
- Date: Thu, 16 Nov 2023 10:42:58 GMT
- Title: Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in
Dense Encoders
- Authors: Hyunji Lee, Luca Soldaini, Arman Cohan, Minjoon Seo, Kyle Lo
- Abstract summary: We study whether training procedures can be improved to yield better generalization capabilities in the resulting models.
We recommend a simple recipe for training dense encoders: Train on MSMARCO with parameter-efficient methods, such as LoRA, and opt for using in-batch negatives unless given well-constructed hard negatives.
- Score: 63.28408887247742
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prevailing research practice today often relies on training dense retrievers
on existing large datasets such as MSMARCO and then experimenting with ways to
improve zero-shot generalization capabilities to unseen domains. While prior
work has tackled this challenge through resource-intensive steps such as data
augmentation, architectural modifications, increasing model size, or even
further base model pretraining, comparatively little investigation has examined
whether the training procedures themselves can be improved to yield better
generalization capabilities in the resulting models. In this work, we recommend
a simple recipe for training dense encoders: Train on MSMARCO with
parameter-efficient methods, such as LoRA, and opt for using in-batch negatives
unless given well-constructed hard negatives. We validate these recommendations
using the BEIR benchmark and find results are persistent across choice of dense
encoder and base model size and are complementary to other resource-intensive
strategies for out-of-domain generalization such as architectural modifications
or additional pretraining. We hope that this thorough and impartial study
around various training techniques, which augments other resource-intensive
methods, offers practical insights for developing a dense retrieval model that
effectively generalizes, even when trained on a single dataset.
Related papers
- Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z) - Noisy Self-Training with Synthetic Queries for Dense Retrieval [49.49928764695172]
We introduce a novel noisy self-training framework combined with synthetic queries.
Experimental results show that our method improves consistently over existing methods.
Our method is data efficient and outperforms competitive baselines.
arXiv Detail & Related papers (2023-11-27T06:19:50Z) - Consistency Regularization for Generalizable Source-free Domain
Adaptation [62.654883736925456]
Source-free domain adaptation (SFDA) aims to adapt a well-trained source model to an unlabelled target domain without accessing the source dataset.
Existing SFDA methods ONLY assess their adapted models on the target training set, neglecting the data from unseen but identically distributed testing sets.
We propose a consistency regularization framework to develop a more generalizable SFDA method.
arXiv Detail & Related papers (2023-08-03T07:45:53Z) - Universal Domain Adaptation from Foundation Models: A Baseline Study [58.51162198585434]
We make empirical studies of state-of-the-art UniDA methods using foundation models.
We introduce textitCLIP distillation, a parameter-free method specifically designed to distill target knowledge from CLIP models.
Although simple, our method outperforms previous approaches in most benchmark tasks.
arXiv Detail & Related papers (2023-05-18T16:28:29Z) - Exploiting All Samples in Low-Resource Sentence Classification: Early Stopping and Initialization Parameters [6.368871731116769]
In this study, we discuss how to exploit labeled samples without additional data or model redesigns.
We propose an integrated method, which is to initialize the model with a weight averaging method and use a non-validation stop method to train all samples.
Our results highlight the importance of the training strategy and suggest that the integrated method can be the first step in the low-resource setting.
arXiv Detail & Related papers (2021-11-12T22:31:47Z) - Sample-based Regularization: A Transfer Learning Strategy Toward Better
Generalization [8.432864879027724]
Training a deep neural network with a small amount of data is a challenging problem.
One of the practical difficulties that we often face is to collect many samples.
By using the source model trained with a large-scale dataset, the target model can alleviate the overfitting originated from the lack of training data.
arXiv Detail & Related papers (2020-07-10T06:02:05Z) - Generative Data Augmentation for Commonsense Reasoning [75.26876609249197]
G-DAUGC is a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting.
G-DAUGC consistently outperforms existing data augmentation methods based on back-translation.
Our analysis demonstrates that G-DAUGC produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance.
arXiv Detail & Related papers (2020-04-24T06:12:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.