Domain-matched Pre-training Tasks for Dense Retrieval
- URL: http://arxiv.org/abs/2107.13602v1
- Date: Wed, 28 Jul 2021 19:13:00 GMT
- Title: Domain-matched Pre-training Tasks for Dense Retrieval
- Authors: Barlas O\u{g}uz, Kushal Lakhotia, Anchit Gupta, Patrick Lewis,
Vladimir Karpukhin, Aleksandra Piktus, Xilun Chen, Sebastian Riedel, Wen-tau
Yih, Sonal Gupta, Yashar Mehdad
- Abstract summary: Pre-training on larger datasets with ever increasing model size is now a proven recipe for increased performance across almost all NLP tasks.
We show that, with the right pre-training setup, this barrier can be overcome.
We demonstrate this by pre-training large bi-encoder models on 1) a recently released set of 65 million synthetically generated questions, and 2) 200 million post-comment pairs from a preexisting dataset of Reddit conversations.
- Score: 68.07140087626637
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-training on larger datasets with ever increasing model size is now a
proven recipe for increased performance across almost all NLP tasks. A notable
exception is information retrieval, where additional pre-training has so far
failed to produce convincing results. We show that, with the right pre-training
setup, this barrier can be overcome. We demonstrate this by pre-training large
bi-encoder models on 1) a recently released set of 65 million synthetically
generated questions, and 2) 200 million post-comment pairs from a preexisting
dataset of Reddit conversations made available by pushshift.io. We evaluate on
a set of information retrieval and dialogue retrieval benchmarks, showing
substantial improvements over supervised baselines.
Related papers
- Divide and Conquer: Hybrid Pre-training for Person Search [40.13016375392472]
We propose a hybrid pre-training framework specifically designed for person search using sub-task data only.
Our model can achieve significant improvements across diverse protocols, such as person search method, fine-tuning data, pre-training data and model backbone.
Our code and pre-trained models are released for plug-and-play usage to the person search community.
arXiv Detail & Related papers (2023-12-13T08:33:50Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Pushing the Limits of Pre-training for Time Series Forecasting in the
CloudOps Domain [54.67888148566323]
We introduce three large-scale time series forecasting datasets from the cloud operations domain.
We show it is a strong zero-shot baseline and benefits from further scaling, both in model and dataset size.
Accompanying these datasets and results is a suite of comprehensive benchmark results comparing classical and deep learning baselines to our pre-trained method.
arXiv Detail & Related papers (2023-10-08T08:09:51Z) - When Less is More: Investigating Data Pruning for Pretraining LLMs at
Scale [12.94829977468838]
Large volumes of text data have contributed significantly to the development of large language models.
To date, efforts to prune datasets down to a higher quality subset have relied on hand-crafteds encoded as rule-based filters.
We take a wider view and explore scalable estimates of data quality that can be used to measure the quality of pretraining data.
arXiv Detail & Related papers (2023-09-08T19:34:05Z) - Unsupervised Dense Retrieval with Relevance-Aware Contrastive
Pre-Training [81.3781338418574]
We propose relevance-aware contrastive learning.
We consistently improve the SOTA unsupervised Contriever model on the BEIR and open-domain QA retrieval benchmarks.
Our method can not only beat BM25 after further pre-training on the target corpus but also serves as a good few-shot learner.
arXiv Detail & Related papers (2023-06-05T18:20:27Z) - AD-PT: Autonomous Driving Pre-Training with Large-scale Point Cloud
Dataset [25.935496432142976]
It is a long-term vision for Autonomous Driving (AD) community that the perception models can learn from a large-scale point cloud dataset.
We formulate the point-cloud pre-training task as a semi-supervised problem, which leverages the few-shot labeled and massive unlabeled point-cloud data.
We achieve significant performance gains on a series of downstream perception benchmarks including nuScenes, and KITTI, under different baseline models.
arXiv Detail & Related papers (2023-06-01T12:32:52Z) - BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning
for Automatic Speech Recognition [126.5605160882849]
We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency.
We report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks.
arXiv Detail & Related papers (2021-09-27T17:59:19Z) - Pretraining Representations for Data-Efficient Reinforcement Learning [12.43475487724972]
We use unlabeled data to pretrain an encoder which is then finetuned on a small amount of task-specific data.
When limited to 100k steps of interaction on Atari games, our approach significantly surpasses prior work.
Our approach shows particular promise when combined with larger models as well as more diverse, task-aligned observational data.
arXiv Detail & Related papers (2021-06-09T04:14:27Z) - Don't Stop Pretraining: Adapt Language Models to Domains and Tasks [81.99843216550306]
We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks.
A second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains.
Adapting to the task's unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining.
arXiv Detail & Related papers (2020-04-23T04:21:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.