DUQGen: Effective Unsupervised Domain Adaptation of Neural Rankers by Diversifying Synthetic Query Generation
- URL: http://arxiv.org/abs/2404.02489v1
- Date: Wed, 3 Apr 2024 05:50:42 GMT
- Title: DUQGen: Effective Unsupervised Domain Adaptation of Neural Rankers by Diversifying Synthetic Query Generation
- Authors: Ramraj Chandradevan, Kaustubh D. Dhole, Eugene Agichtein,
- Abstract summary: State-of-the-art neural rankers pre-trained on large task-specific training data such as MS-MARCO, have been shown to exhibit strong performance on various ranking tasks without domain adaptation, also called zero-shot.
We propose a new approach to unsupervised domain adaptation for ranking, DUQGen, which addresses a critical gap in prior literature.
- Score: 8.661419320202787
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: State-of-the-art neural rankers pre-trained on large task-specific training data such as MS-MARCO, have been shown to exhibit strong performance on various ranking tasks without domain adaptation, also called zero-shot. However, zero-shot neural ranking may be sub-optimal, as it does not take advantage of the target domain information. Unfortunately, acquiring sufficiently large and high quality target training data to improve a modern neural ranker can be costly and time-consuming. To address this problem, we propose a new approach to unsupervised domain adaptation for ranking, DUQGen, which addresses a critical gap in prior literature, namely how to automatically generate both effective and diverse synthetic training data to fine tune a modern neural ranker for a new domain. Specifically, DUQGen produces a more effective representation of the target domain by identifying clusters of similar documents; and generates a more diverse training dataset by probabilistic sampling over the resulting document clusters. Our extensive experiments, over the standard BEIR collection, demonstrate that DUQGen consistently outperforms all zero-shot baselines and substantially outperforms the SOTA baselines on 16 out of 18 datasets, for an average of 4% relative improvement across all datasets. We complement our results with a thorough analysis for more in-depth understanding of the proposed method's performance and to identify promising areas for further improvements.
Related papers
- Noisy Self-Training with Synthetic Queries for Dense Retrieval [49.49928764695172]
We introduce a novel noisy self-training framework combined with synthetic queries.
Experimental results show that our method improves consistently over existing methods.
Our method is data efficient and outperforms competitive baselines.
arXiv Detail & Related papers (2023-11-27T06:19:50Z) - Activate and Reject: Towards Safe Domain Generalization under Category
Shift [71.95548187205736]
We study a practical problem of Domain Generalization under Category Shift (DGCS)
It aims to simultaneously detect unknown-class samples and classify known-class samples in the target domains.
Compared to prior DG works, we face two new challenges: 1) how to learn the concept of unknown'' during training with only source known-class samples, and 2) how to adapt the source-trained model to unseen environments.
arXiv Detail & Related papers (2023-10-07T07:53:12Z) - CNN Feature Map Augmentation for Single-Source Domain Generalization [6.053629733936548]
Domain Generalization (DG) has gained significant traction during the past few years.
The goal in DG is to produce models which continue to perform well when presented with data distributions different from the ones available during training.
We propose an alternative regularization technique for convolutional neural network architectures in the single-source DG image classification setting.
arXiv Detail & Related papers (2023-05-26T08:48:17Z) - Enhancing Few-shot NER with Prompt Ordering based Data Augmentation [59.69108119752584]
We propose a Prompt Ordering based Data Augmentation (PODA) method to improve the training of unified autoregressive generation frameworks.
Experimental results on three public NER datasets and further analyses demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2023-05-19T16:25:43Z) - Model-based Transfer Learning for Automatic Optical Inspection based on
domain discrepancy [9.039797705929363]
This research applies model-based TL via domain similarity to improve the overall performance and data augmentation in both target and source domains.
Our research suggests increases in the F1 score and the PR curve up to 20% compared with TL using benchmark datasets.
arXiv Detail & Related papers (2023-01-14T11:32:39Z) - Data Augmentation techniques in time series domain: A survey and
taxonomy [0.20971479389679332]
Deep neural networks used to work with time series heavily depend on the size and consistency of the datasets used in training.
This work systematically reviews the current state-of-the-art in the area to provide an overview of all available algorithms.
The ultimate aim of this study is to provide a summary of the evolution and performance of areas that produce better results to guide future researchers in this field.
arXiv Detail & Related papers (2022-06-25T17:09:00Z) - Deep face recognition with clustering based domain adaptation [57.29464116557734]
We propose a new clustering-based domain adaptation method designed for face recognition task in which the source and target domain do not share any classes.
Our method effectively learns the discriminative target feature by aligning the feature domain globally, and, at the meantime, distinguishing the target clusters locally.
arXiv Detail & Related papers (2022-05-27T12:29:11Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - Boosting the Generalization Capability in Cross-Domain Few-shot Learning
via Noise-enhanced Supervised Autoencoder [23.860842627883187]
We teach the model to capture broader variations of the feature distributions with a novel noise-enhanced supervised autoencoder (NSAE)
NSAE trains the model by jointly reconstructing inputs and predicting the labels of inputs as well as their reconstructed pairs.
We also take advantage of NSAE structure and propose a two-step fine-tuning procedure that achieves better adaption and improves classification performance in the target domain.
arXiv Detail & Related papers (2021-08-11T04:45:56Z) - Ranking Creative Language Characteristics in Small Data Scenarios [52.00161818003478]
We adapt the DirectRanker to provide a new deep model for ranking creative language with small data.
Our experiments with sparse training data show that while the performance of standard neural ranking approaches collapses with small datasets, DirectRanker remains effective.
arXiv Detail & Related papers (2020-10-23T18:57:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.