Related papers: Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in Dense Encoders

Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in Dense Encoders

URL: http://arxiv.org/abs/2311.09765v1
Date: Thu, 16 Nov 2023 10:42:58 GMT
Title: Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in Dense Encoders
Authors: Hyunji Lee, Luca Soldaini, Arman Cohan, Minjoon Seo, Kyle Lo
Abstract summary: We study whether training procedures can be improved to yield better generalization capabilities in the resulting models. We recommend a simple recipe for training dense encoders: Train on MSMARCO with parameter-efficient methods, such as LoRA, and opt for using in-batch negatives unless given well-constructed hard negatives.
Score: 63.28408887247742
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Prevailing research practice today often relies on training dense retrievers on existing large datasets such as MSMARCO and then experimenting with ways to improve zero-shot generalization capabilities to unseen domains. While prior work has tackled this challenge through resource-intensive steps such as data augmentation, architectural modifications, increasing model size, or even further base model pretraining, comparatively little investigation has examined whether the training procedures themselves can be improved to yield better generalization capabilities in the resulting models. In this work, we recommend a simple recipe for training dense encoders: Train on MSMARCO with parameter-efficient methods, such as LoRA, and opt for using in-batch negatives unless given well-constructed hard negatives. We validate these recommendations using the BEIR benchmark and find results are persistent across choice of dense encoder and base model size and are complementary to other resource-intensive strategies for out-of-domain generalization such as architectural modifications or additional pretraining. We hope that this thorough and impartial study around various training techniques, which augments other resource-intensive methods, offers practical insights for developing a dense retrieval model that effectively generalizes, even when trained on a single dataset.

Related papers

ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection [28.75333303894706]
ToReMi is a novel framework that adjusts training sample weights according to their topical associations and observed learning patterns. Our experiments reveal that ToReMi variants consistently achieve superior performance over conventional pre-training approaches.
arXiv Detail & Related papers (2025-04-01T12:06:42Z)
Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance. We introduce novel algorithms for dynamic, instance-level data reweighting. Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z)
Transfer Learning with Foundational Models for Time Series Forecasting using Low-Rank Adaptations [0.0]
This study proposes the methodology LLIAM, a straightforward adaptation of a kind of FM, Large Language Models, for the Time Series Forecasting task. A comparison was made between the performance of LLIAM and different state-of-the-art DL algorithms, including Recurrent Neural Networks and Temporal Convolutional Networks, as well as a LLM-based method, TimeLLM. The outcomes of this investigation demonstrate the efficacy of LLIAM, highlighting that this straightforward and general approach can attain competent results without the necessity for applying complex modifications.
arXiv Detail & Related papers (2024-10-15T12:14:01Z)
Take the Bull by the Horns: Hard Sample-Reweighted Continual Training Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data. Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets. We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z)
Noisy Self-Training with Synthetic Queries for Dense Retrieval [49.49928764695172]
We introduce a novel noisy self-training framework combined with synthetic queries. Experimental results show that our method improves consistently over existing methods. Our method is data efficient and outperforms competitive baselines.
arXiv Detail & Related papers (2023-11-27T06:19:50Z)
Consistency Regularization for Generalizable Source-free Domain Adaptation [62.654883736925456]
Source-free domain adaptation (SFDA) aims to adapt a well-trained source model to an unlabelled target domain without accessing the source dataset. Existing SFDA methods ONLY assess their adapted models on the target training set, neglecting the data from unseen but identically distributed testing sets. We propose a consistency regularization framework to develop a more generalizable SFDA method.
arXiv Detail & Related papers (2023-08-03T07:45:53Z)
Universal Domain Adaptation from Foundation Models: A Baseline Study [58.51162198585434]
We make empirical studies of state-of-the-art UniDA methods using foundation models. We introduce textitCLIP distillation, a parameter-free method specifically designed to distill target knowledge from CLIP models. Although simple, our method outperforms previous approaches in most benchmark tasks.
arXiv Detail & Related papers (2023-05-18T16:28:29Z)
Exploiting All Samples in Low-Resource Sentence Classification: Early Stopping and Initialization Parameters [6.368871731116769]
In this study, we discuss how to exploit labeled samples without additional data or model redesigns. We propose an integrated method, which is to initialize the model with a weight averaging method and use a non-validation stop method to train all samples. Our results highlight the importance of the training strategy and suggest that the integrated method can be the first step in the low-resource setting.
arXiv Detail & Related papers (2021-11-12T22:31:47Z)
Sample-based Regularization: A Transfer Learning Strategy Toward Better Generalization [8.432864879027724]
Training a deep neural network with a small amount of data is a challenging problem. One of the practical difficulties that we often face is to collect many samples. By using the source model trained with a large-scale dataset, the target model can alleviate the overfitting originated from the lack of training data.
arXiv Detail & Related papers (2020-07-10T06:02:05Z)
Generative Data Augmentation for Commonsense Reasoning [75.26876609249197]
G-DAUGC is a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting. G-DAUGC consistently outperforms existing data augmentation methods based on back-translation. Our analysis demonstrates that G-DAUGC produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance.
arXiv Detail & Related papers (2020-04-24T06:12:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.