M2D2: A Massively Multi-domain Language Modeling Dataset
- URL: http://arxiv.org/abs/2210.07370v1
- Date: Thu, 13 Oct 2022 21:34:52 GMT
- Title: M2D2: A Massively Multi-domain Language Modeling Dataset
- Authors: Machel Reid, Victor Zhong, Suchin Gururangan, Luke Zettlemoyer
- Abstract summary: We present M2D2, a fine-grained, massively multi-domain corpus for studying domain adaptation (LMs)
Using categories derived from Wikipedia and ArXiv, we organize the domains in each data source into 22 groups.
We show the benefits of adapting the LM along a domain hierarchy; adapting to smaller amounts of fine-grained domain-specific data can lead to larger in-domain performance gains.
- Score: 76.13062203588089
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present M2D2, a fine-grained, massively multi-domain corpus for studying
domain adaptation in language models (LMs). M2D2 consists of 8.5B tokens and
spans 145 domains extracted from Wikipedia and Semantic Scholar. Using
ontologies derived from Wikipedia and ArXiv categories, we organize the domains
in each data source into 22 groups. This two-level hierarchy enables the study
of relationships between domains and their effects on in- and out-of-domain
performance after adaptation. We also present a number of insights into the
nature of effective domain adaptation in LMs, as examples of the new types of
studies M2D2 enables. To improve in-domain performance, we show the benefits of
adapting the LM along a domain hierarchy; adapting to smaller amounts of
fine-grained domain-specific data can lead to larger in-domain performance
gains than larger amounts of weakly relevant data. We further demonstrate a
trade-off between in-domain specialization and out-of-domain generalization
within and across ontologies, as well as a strong correlation between
out-of-domain performance and lexical overlap between domains.
Related papers
- Dynamic Instance Domain Adaptation [109.53575039217094]
Most studies on unsupervised domain adaptation assume that each domain's training samples come with domain labels.
We develop a dynamic neural network with adaptive convolutional kernels to generate instance-adaptive residuals to adapt domain-agnostic deep features to each individual instance.
Our model, dubbed DIDA-Net, achieves state-of-the-art performance on several commonly used single-source and multi-source UDA datasets.
arXiv Detail & Related papers (2022-03-09T20:05:54Z) - Efficient Hierarchical Domain Adaptation for Pretrained Language Models [77.02962815423658]
Generative language models are trained on diverse, general domain corpora.
We introduce a method to scale domain adaptation to many diverse domains using a computationally efficient adapter approach.
arXiv Detail & Related papers (2021-12-16T11:09:29Z) - TAL: Two-stream Adaptive Learning for Generalizable Person
Re-identification [115.31432027711202]
We argue that both domain-specific and domain-invariant features are crucial for improving the generalization ability of re-id models.
We name two-stream adaptive learning (TAL) to simultaneously model these two kinds of information.
Our framework can be applied to both single-source and multi-source domain generalization tasks.
arXiv Detail & Related papers (2021-11-29T01:27:42Z) - Multi-Level Features Contrastive Networks for Unsupervised Domain
Adaptation [6.934905764152813]
Unsupervised domain adaptation aims to train a model from the labeled source domain to make predictions on the unlabeled target domain.
Existing methods tend to align the two domains directly at the domain-level, or perform class-level domain alignment based on deep feature.
In this paper, we develop this work on the method of class-level alignment.
arXiv Detail & Related papers (2021-09-14T09:23:27Z) - Domain2Vec: Domain Embedding for Unsupervised Domain Adaptation [56.94873619509414]
Conventional unsupervised domain adaptation studies the knowledge transfer between a limited number of domains.
We propose a novel Domain2Vec model to provide vectorial representations of visual domains based on joint learning of feature disentanglement and Gram matrix.
We demonstrate that our embedding is capable of predicting domain similarities that match our intuition about visual relations between different domains.
arXiv Detail & Related papers (2020-07-17T22:05:09Z) - Domain Adaptation for Semantic Parsing [68.81787666086554]
We propose a novel semantic for domain adaptation, where we have much fewer annotated data in the target domain compared to the source domain.
Our semantic benefits from a two-stage coarse-to-fine framework, thus can provide different and accurate treatments for the two stages.
Experiments on a benchmark dataset show that our method consistently outperforms several popular domain adaptation strategies.
arXiv Detail & Related papers (2020-06-23T14:47:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.