Selecting which Dense Retriever to use for Zero-Shot Search
- URL: http://arxiv.org/abs/2309.09403v1
- Date: Mon, 18 Sep 2023 00:01:24 GMT
- Title: Selecting which Dense Retriever to use for Zero-Shot Search
- Authors: Ekaterina Khramtsova, Shengyao Zhuang, Mahsa Baktashmotlagh, Xi Wang,
Guido Zuccon
- Abstract summary: We propose a new problem of choosing which dense retrieval model to use when searching on a new collection for which no labels are available.
We show that methods inspired by recent work in unsupervised performance evaluation are not effective for choosing highly performing dense retrievers.
- Score: 34.04158960512326
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose the new problem of choosing which dense retrieval model to use
when searching on a new collection for which no labels are available, i.e. in a
zero-shot setting. Many dense retrieval models are readily available. Each
model however is characterized by very differing search effectiveness -- not
just on the test portion of the datasets in which the dense representations
have been learned but, importantly, also across different datasets for which
data was not used to learn the dense representations. This is because dense
retrievers typically require training on a large amount of labeled data to
achieve satisfactory search effectiveness in a specific dataset or domain.
Moreover, effectiveness gains obtained by dense retrievers on datasets for
which they are able to observe labels during training, do not necessarily
generalise to datasets that have not been observed during training. This is
however a hard problem: through empirical experimentation we show that methods
inspired by recent work in unsupervised performance evaluation with the
presence of domain shift in the area of computer vision and machine learning
are not effective for choosing highly performing dense retrievers in our setup.
The availability of reliable methods for the selection of dense retrieval
models in zero-shot settings that do not require the collection of labels for
evaluation would allow to streamline the widespread adoption of dense
retrieval. This is therefore an important new problem we believe the
information retrieval community should consider. Implementation of methods,
along with raw result files and analysis scripts are made publicly available at
https://www.github.com/anonymized.
Related papers
- A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box.
This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z) - XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners [71.8257151788923]
We propose a novel Explainable Active Learning framework (XAL) for low-resource text classification.
XAL encourages classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations.
Experiments on six datasets show that XAL achieves consistent improvement over 9 strong baselines.
arXiv Detail & Related papers (2023-10-09T08:07:04Z) - Exploring Data Redundancy in Real-world Image Classification through
Data Selection [20.389636181891515]
Deep learning models often require large amounts of data for training, leading to increased costs.
We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data.
Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
arXiv Detail & Related papers (2023-06-25T03:31:05Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - Dominant Set-based Active Learning for Text Classification and its
Application to Online Social Media [0.0]
We present a novel pool-based active learning method for the training of large unlabeled corpus with minimum annotation cost.
Our proposed method does not have any parameters to be tuned, making it dataset-independent.
Our method achieves a higher performance in comparison to the state-of-the-art active learning strategies.
arXiv Detail & Related papers (2022-01-28T19:19:03Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - Meta-Learning for Neural Relation Classification with Distant
Supervision [38.755055486296435]
We propose a meta-learning based approach, which learns to reweight noisy training data under the guidance of reference data.
Experiments on several datasets demonstrate that the reference data can effectively guide the selection of training data.
arXiv Detail & Related papers (2020-10-26T12:52:28Z) - Adversarial Knowledge Transfer from Unlabeled Data [62.97253639100014]
We present a novel Adversarial Knowledge Transfer framework for transferring knowledge from internet-scale unlabeled data to improve the performance of a classifier.
An important novel aspect of our method is that the unlabeled source data can be of different classes from those of the labeled target data, and there is no need to define a separate pretext task.
arXiv Detail & Related papers (2020-08-13T08:04:27Z) - DEAL: Deep Evidential Active Learning for Image Classification [0.0]
Active Learning (AL) is one approach to mitigate the problem of limited labeled data.
Recent AL methods for CNNs propose different solutions for the selection of instances to be labeled.
We propose a novel AL algorithm that efficiently learns from unlabeled data by capturing high prediction uncertainty.
arXiv Detail & Related papers (2020-07-22T11:14:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.