Unsupervised Domain Adaptation for Sparse Retrieval by Filling
Vocabulary and Word Frequency Gaps
- URL: http://arxiv.org/abs/2211.03988v1
- Date: Tue, 8 Nov 2022 03:58:26 GMT
- Title: Unsupervised Domain Adaptation for Sparse Retrieval by Filling
Vocabulary and Word Frequency Gaps
- Authors: Hiroki Iida and Naoaki Okazaki
- Abstract summary: IR models using a pretrained language model significantly outperform lexical approaches like BM25.
This paper proposes an unsupervised domain adaptation method by filling vocabulary and word-frequency gaps.
We show that our method outperforms the present stateof-the-art domain adaptation method.
- Score: 12.573927420408365
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: IR models using a pretrained language model significantly outperform lexical
approaches like BM25. In particular, SPLADE, which encodes texts to sparse
vectors, is an effective model for practical use because it shows robustness to
out-of-domain datasets. However, SPLADE still struggles with exact matching of
low-frequency words in training data. In addition, domain shifts in vocabulary
and word frequencies deteriorate the IR performance of SPLADE. Because
supervision data are scarce in the target domain, addressing the domain shifts
without supervision data is necessary. This paper proposes an unsupervised
domain adaptation method by filling vocabulary and word-frequency gaps. First,
we expand a vocabulary and execute continual pretraining with a masked language
model on a corpus of the target domain. Then, we multiply SPLADE-encoded sparse
vectors by inverse document frequency weights to consider the importance of
documents with lowfrequency words. We conducted experiments using our method on
datasets with a large vocabulary gap from a source domain. We show that our
method outperforms the present stateof-the-art domain adaptation method. In
addition, our method achieves state-of-the-art results, combined with BM25.
Related papers
- Contextual Biasing to Improve Domain-specific Custom Vocabulary Audio Transcription without Explicit Fine-Tuning of Whisper Model [0.0]
OpenAI's Whisper Automated Speech Recognition model excels in generalizing across diverse datasets and domains.
We propose a method to enhance transcription accuracy without explicit fine-tuning or altering model parameters.
arXiv Detail & Related papers (2024-10-24T01:58:11Z) - VLAD-VSA: Cross-Domain Face Presentation Attack Detection with
Vocabulary Separation and Adaptation [87.9994254822078]
For face presentation attack (PAD), most of the spoofing cues are subtle, local image patterns.
VLAD aggregation method is adopted to quantize local features with visual vocabulary locally partitioning the feature space.
Proposed vocabulary separation method divides vocabulary into domain-shared and domain-specific visual words.
arXiv Detail & Related papers (2022-02-21T15:27:41Z) - AVocaDo: Strategy for Adapting Vocabulary to Downstream Domain [17.115865763783336]
We propose to consider the vocabulary as an optimizable parameter, allowing us to update the vocabulary by expanding it with domain-specific vocabulary.
We preserve the embeddings of the added words from overfitting to downstream data by utilizing knowledge learned from a pretrained language model with a regularization term.
arXiv Detail & Related papers (2021-10-26T06:26:01Z) - Contrastive Learning and Self-Training for Unsupervised Domain
Adaptation in Semantic Segmentation [71.77083272602525]
UDA attempts to provide efficient knowledge transfer from a labeled source domain to an unlabeled target domain.
We propose a contrastive learning approach that adapts category-wise centroids across domains.
We extend our method with self-training, where we use a memory-efficient temporal ensemble to generate consistent and reliable pseudo-labels.
arXiv Detail & Related papers (2021-05-05T11:55:53Z) - Revisiting Mahalanobis Distance for Transformer-Based Out-of-Domain
Detection [60.88952532574564]
This paper conducts a thorough comparison of out-of-domain intent detection methods.
We evaluate multiple contextual encoders and methods, proven to be efficient, on three standard datasets for intent classification.
Our main findings show that fine-tuning Transformer-based encoders on in-domain data leads to superior results.
arXiv Detail & Related papers (2021-01-11T09:10:58Z) - MASKER: Masked Keyword Regularization for Reliable Text Classification [73.90326322794803]
We propose a fine-tuning method, coined masked keyword regularization (MASKER), that facilitates context-based prediction.
MASKER regularizes the model to reconstruct the keywords from the rest of the words and make low-confidence predictions without enough context.
We demonstrate that MASKER improves OOD detection and cross-domain generalization without degrading classification accuracy.
arXiv Detail & Related papers (2020-12-17T04:54:16Z) - Accelerating Text Mining Using Domain-Specific Stop Word Lists [57.76576681191192]
We present a novel approach for the automatic extraction of domain-specific words called the hyperplane-based approach.
The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features.
Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information.
arXiv Detail & Related papers (2020-11-18T17:42:32Z) - CMT in TREC-COVID Round 2: Mitigating the Generalization Gaps from Web
to Special Domain Search [89.48123965553098]
This paper presents a search system to alleviate the special domain adaption problem.
The system utilizes the domain-adaptive pretraining and few-shot learning technologies to help neural rankers mitigate the domain discrepancy.
Our system performs the best among the non-manual runs in Round 2 of the TREC-COVID task.
arXiv Detail & Related papers (2020-11-03T09:10:48Z) - Coupling Distant Annotation and Adversarial Training for Cross-Domain
Chinese Word Segmentation [40.27961925319402]
This paper proposes to couple distant annotation and adversarial training for cross-domain Chinese word segmentation.
For distant annotation, we design an automatic distant annotation mechanism that does not need any supervision or pre-defined dictionaries from the target domain.
For adversarial training, we develop a sentence-level training procedure to perform noise reduction and maximum utilization of the source domain information.
arXiv Detail & Related papers (2020-07-16T08:54:17Z) - Unsupervised Paraphrasing via Deep Reinforcement Learning [33.00732998036464]
Progressive Unsupervised Paraphrasing (PUP) is an unsupervised paraphrase generation method based on deep reinforcement learning (DRL)
PUP uses a variational autoencoder to generate a seed paraphrase that warm-starts the DRL model.
Then, PUP progressively tunes the seed paraphrase guided by our novel reward function which combines semantic adequacy, language fluency, and expression diversity measures.
arXiv Detail & Related papers (2020-07-05T05:54:02Z) - Vocabulary Adaptation for Distant Domain Adaptation in Neural Machine
Translation [14.390932594872233]
Domain adaptation between distant domains cannot be performed effectively due to mismatches in vocabulary.
We propose vocabulary adaptation, a simple method for effective fine-tuning.
Our method improves the performance of conventional fine-tuning by 3.86 and 3.28 BLEU points in En-Ja and De-En translation.
arXiv Detail & Related papers (2020-04-30T14:27:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.