Unsupervised Domain Clusters in Pretrained Language Models
- URL: http://arxiv.org/abs/2004.02105v2
- Date: Fri, 1 May 2020 16:49:27 GMT
- Title: Unsupervised Domain Clusters in Pretrained Language Models
- Authors: Roee Aharoni and Yoav Goldberg
- Abstract summary: We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision.
We propose domain data selection methods based on such models.
We evaluate our data selection methods for neural machine translation across five diverse domains.
- Score: 61.832234606157286
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The notion of "in-domain data" in NLP is often over-simplistic and vague, as
textual data varies in many nuanced linguistic aspects such as topic, style or
level of formality. In addition, domain labels are many times unavailable,
making it challenging to build domain-specific systems. We show that massive
pre-trained language models implicitly learn sentence representations that
cluster by domains without supervision -- suggesting a simple data-driven
definition of domains in textual data. We harness this property and propose
domain data selection methods based on such models, which require only a small
set of in-domain monolingual data. We evaluate our data selection methods for
neural machine translation across five diverse domains, where they outperform
an established approach as measured by both BLEU and by precision and recall of
sentence selection with respect to an oracle.
Related papers
- Learning to Generalize Unseen Domains via Multi-Source Meta Learning for Text Classification [71.08024880298613]
We study the multi-source Domain Generalization of text classification.
We propose a framework to use multiple seen domains to train a model that can achieve high accuracy in an unseen domain.
arXiv Detail & Related papers (2024-09-20T07:46:21Z) - WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization [63.98650220772378]
We present WIDIn, Wording Images for Domain-Invariant representation, to disentangle discriminative visual representation.
We first estimate the language embedding with fine-grained alignment, which can be used to adaptively identify and then remove domain-specific counterpart.
We show that WIDIn can be applied to both pretrained vision-language models like CLIP, and separately trained uni-modal models like MoCo and BERT.
arXiv Detail & Related papers (2024-05-28T17:46:27Z) - Domain Private Transformers for Multi-Domain Dialog Systems [2.7013801448234367]
This paper proposes domain privacy as a novel way to quantify how likely a conditional language model will leak across domains.
Experiments on membership inference attacks show that our proposed method has comparable resiliency to methods adapted from recent literature on differentially private language models.
arXiv Detail & Related papers (2023-05-23T16:27:12Z) - Label-Free Multi-Domain Machine Translation with Stage-wise Training [13.144729358707206]
We propose a label-free multi-domain machine translation model which requires only a few or no domain-annotated data in training and no domain labels in inference.
Our model is composed of three parts: a backbone model, a domain discriminator taking responsibility to discriminate data from different domains, and a set of experts that transfer the decoded features from generic to specific.
arXiv Detail & Related papers (2023-05-06T06:30:29Z) - SwitchPrompt: Learning Domain-Specific Gated Soft Prompts for
Classification in Low-Resource Domains [14.096170976149521]
SwitchPrompt is a novel and lightweight prompting methodology for adaptation of language models trained on datasets from the general domain to diverse low-resource domains.
Our few-shot experiments on three text classification benchmarks demonstrate the efficacy of the general-domain pre-trained language models when used with SwitchPrompt.
They often even outperform their domain-specific counterparts trained with baseline state-of-the-art prompting methods by up to 10.7% performance increase in accuracy.
arXiv Detail & Related papers (2023-02-14T07:14:08Z) - Few-Shot Classification in Unseen Domains by Episodic Meta-Learning
Across Visual Domains [36.98387822136687]
Few-shot classification aims to carry out classification given only few labeled examples for the categories of interest.
In this paper, we present a unique learning framework for domain-generalized few-shot classification.
By advancing meta-learning strategies, our learning framework exploits data across multiple source domains to capture domain-invariant features.
arXiv Detail & Related papers (2021-12-27T06:54:11Z) - Efficient Domain Adaptation of Language Models via Adaptive Tokenization [5.058301279065432]
We show that domain-specific subword sequences can be efficiently determined directly from divergences in the conditional token distributions of the base and domain-specific corpora.
Our approach produces smaller models and less training and inference time than other approaches using tokenizer augmentation.
arXiv Detail & Related papers (2021-09-15T17:51:27Z) - FDMT: A Benchmark Dataset for Fine-grained Domain Adaptation in Machine
Translation [53.87731008029645]
We present a real-world fine-grained domain adaptation task in machine translation (FDMT)
The FDMT dataset consists of four sub-domains of information technology: autonomous vehicles, AI education, real-time networks and smart phone.
We make quantitative experiments and deep analyses in this new setting, which benchmarks the fine-grained domain adaptation task.
arXiv Detail & Related papers (2020-12-31T17:15:09Z) - A Review of Single-Source Deep Unsupervised Visual Domain Adaptation [81.07994783143533]
Large-scale labeled training datasets have enabled deep neural networks to excel across a wide range of benchmark vision tasks.
In many applications, it is prohibitively expensive and time-consuming to obtain large quantities of labeled data.
To cope with limited labeled training data, many have attempted to directly apply models trained on a large-scale labeled source domain to another sparsely labeled or unlabeled target domain.
arXiv Detail & Related papers (2020-09-01T00:06:50Z) - Domain Adaptation for Semantic Parsing [68.81787666086554]
We propose a novel semantic for domain adaptation, where we have much fewer annotated data in the target domain compared to the source domain.
Our semantic benefits from a two-stage coarse-to-fine framework, thus can provide different and accurate treatments for the two stages.
Experiments on a benchmark dataset show that our method consistently outperforms several popular domain adaptation strategies.
arXiv Detail & Related papers (2020-06-23T14:47:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.