Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in
Dense Encoders
- URL: http://arxiv.org/abs/2311.09765v1
- Date: Thu, 16 Nov 2023 10:42:58 GMT
- Title: Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in
Dense Encoders
- Authors: Hyunji Lee, Luca Soldaini, Arman Cohan, Minjoon Seo, Kyle Lo
- Abstract summary: We study whether training procedures can be improved to yield better generalization capabilities in the resulting models.
We recommend a simple recipe for training dense encoders: Train on MSMARCO with parameter-efficient methods, such as LoRA, and opt for using in-batch negatives unless given well-constructed hard negatives.
- Score: 63.28408887247742
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prevailing research practice today often relies on training dense retrievers
on existing large datasets such as MSMARCO and then experimenting with ways to
improve zero-shot generalization capabilities to unseen domains. While prior
work has tackled this challenge through resource-intensive steps such as data
augmentation, architectural modifications, increasing model size, or even
further base model pretraining, comparatively little investigation has examined
whether the training procedures themselves can be improved to yield better
generalization capabilities in the resulting models. In this work, we recommend
a simple recipe for training dense encoders: Train on MSMARCO with
parameter-efficient methods, such as LoRA, and opt for using in-batch negatives
unless given well-constructed hard negatives. We validate these recommendations
using the BEIR benchmark and find results are persistent across choice of dense
encoder and base model size and are complementary to other resource-intensive
strategies for out-of-domain generalization such as architectural modifications
or additional pretraining. We hope that this thorough and impartial study
around various training techniques, which augments other resource-intensive
methods, offers practical insights for developing a dense retrieval model that
effectively generalizes, even when trained on a single dataset.
Related papers
- Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance.
We introduce novel algorithms for dynamic, instance-level data reweighting.
Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z) - COBRA: COmBinatorial Retrieval Augmentation for Few-Shot Learning [5.840239260337972]
We propose COBRA (COmBinatorial Retrieval Augmentation), which employs an alternative CMI measure that considers both diversity and similarity to a target dataset.
COBRA introduces negligible computational overhead to the cost of retrieval while providing significant gains in downstream model performance.
arXiv Detail & Related papers (2024-12-23T16:10:07Z) - Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z) - Noisy Self-Training with Synthetic Queries for Dense Retrieval [49.49928764695172]
We introduce a novel noisy self-training framework combined with synthetic queries.
Experimental results show that our method improves consistently over existing methods.
Our method is data efficient and outperforms competitive baselines.
arXiv Detail & Related papers (2023-11-27T06:19:50Z) - Universal Domain Adaptation from Foundation Models: A Baseline Study [58.51162198585434]
We make empirical studies of state-of-the-art UniDA methods using foundation models.
We introduce textitCLIP distillation, a parameter-free method specifically designed to distill target knowledge from CLIP models.
Although simple, our method outperforms previous approaches in most benchmark tasks.
arXiv Detail & Related papers (2023-05-18T16:28:29Z) - Sample-based Regularization: A Transfer Learning Strategy Toward Better
Generalization [8.432864879027724]
Training a deep neural network with a small amount of data is a challenging problem.
One of the practical difficulties that we often face is to collect many samples.
By using the source model trained with a large-scale dataset, the target model can alleviate the overfitting originated from the lack of training data.
arXiv Detail & Related papers (2020-07-10T06:02:05Z) - Generative Data Augmentation for Commonsense Reasoning [75.26876609249197]
G-DAUGC is a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting.
G-DAUGC consistently outperforms existing data augmentation methods based on back-translation.
Our analysis demonstrates that G-DAUGC produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance.
arXiv Detail & Related papers (2020-04-24T06:12:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.