DEMix Layers: Disentangling Domains for Modular Language Modeling
- URL: http://arxiv.org/abs/2108.05036v1
- Date: Wed, 11 Aug 2021 05:15:33 GMT
- Title: DEMix Layers: Disentangling Domains for Modular Language Modeling
- Authors: Suchin Gururangan, Mike Lewis, Ari Holtzman, Noah A. Smith, Luke
Zettlemoyer
- Abstract summary: We introduce a new domain expert mixture (DEMix) layer that enables conditioning a language model (LM) on the domain of the input text.
A DEMix layer is a collection of expert feedforward networks, each specialized to a domain.
Experiments show that DEMix layers reduce test-time perplexity, increase training efficiency, and enable rapid adaptation with little overhead.
- Score: 92.57761975953453
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a new domain expert mixture (DEMix) layer that enables
conditioning a language model (LM) on the domain of the input text. A DEMix
layer is a collection of expert feedforward networks, each specialized to a
domain, that makes the LM modular: experts can be mixed, added or removed after
initial training. Extensive experiments with autoregressive transformer LMs (up
to 1.3B parameters) show that DEMix layers reduce test-time perplexity,
increase training efficiency, and enable rapid adaptation with little overhead.
We show that mixing experts during inference, using a parameter-free weighted
ensemble, allows the model to better generalize to heterogeneous or unseen
domains. We also show that experts can be added to iteratively incorporate new
domains without forgetting older ones, and that experts can be removed to
restrict access to unwanted domains, without additional training. Overall,
these results demonstrate benefits of explicitly conditioning on textual
domains during language modeling.
Related papers
- Monet: Mixture of Monosemantic Experts for Transformers [33.8311330578753]
We introduce Mixture of Monosemantic Experts for Transformers (Monet) architecture.
Monet incorporates sparse dictionary learning directly into end-to-end Mixture-of-Experts pretraining.
Our analyses demonstrate mutual exclusivity of knowledge across experts and showcase the parametric knowledge encapsulated within individual experts.
arXiv Detail & Related papers (2024-12-05T13:06:03Z) - UniMix: Towards Domain Adaptive and Generalizable LiDAR Semantic Segmentation in Adverse Weather [55.95708988160047]
LiDAR semantic segmentation (LSS) is a critical task in autonomous driving.
Prior LSS methods are investigated and evaluated on datasets within the same domain in clear weather.
We propose UniMix, a universal method that enhances the adaptability and generalizability of LSS models.
arXiv Detail & Related papers (2024-04-08T02:02:15Z) - Role Prompting Guided Domain Adaptation with General Capability Preserve
for Large Language Models [55.51408151807268]
When tailored to specific domains, Large Language Models (LLMs) tend to experience catastrophic forgetting.
crafting a versatile model for multiple domains simultaneously often results in a decline in overall performance.
We present the RolE Prompting Guided Multi-Domain Adaptation (REGA) strategy.
arXiv Detail & Related papers (2024-03-05T08:22:41Z) - BECoTTA: Input-dependent Online Blending of Experts for Continual Test-time Adaptation [59.1863462632777]
Continual Test Time Adaptation (CTTA) is required to adapt efficiently to continuous unseen domains while retaining previously learned knowledge.
This paper proposes BECoTTA, an input-dependent and efficient modular framework for CTTA.
We validate that our method outperforms multiple CTTA scenarios, including disjoint and gradual domain shits, while only requiring 98% fewer trainable parameters.
arXiv Detail & Related papers (2024-02-13T18:37:53Z) - Decoupled Training: Return of Frustratingly Easy Multi-Domain Learning [20.17925272562433]
Multi-domain learning aims to train a model with minimal average risk across multiple overlapping but non-identical domains.
We propose Decoupled Training (D-Train) as a frustratingly easy and hyper parameter-free multi-domain learning method.
D-Train is a tri-phase general-to-specific training strategy that first pre-trains on all domains to warm up a root model, then post-trains on each domain by splitting into multi-heads, and finally fine-tunes the heads by fixing the backbone.
arXiv Detail & Related papers (2023-09-19T04:06:41Z) - Meta-DMoE: Adapting to Domain Shift by Meta-Distillation from
Mixture-of-Experts [33.21435044949033]
Most existing methods perform training on multiple source domains using a single model.
We propose a novel framework for unsupervised test-time adaptation, which is formulated as a knowledge distillation process.
arXiv Detail & Related papers (2022-10-08T02:28:10Z) - META: Mimicking Embedding via oThers' Aggregation for Generalizable
Person Re-identification [68.39849081353704]
Domain generalizable (DG) person re-identification (ReID) aims to test across unseen domains without access to the target domain data at training time.
This paper presents a new approach called Mimicking Embedding via oThers' Aggregation (META) for DG ReID.
arXiv Detail & Related papers (2021-12-16T08:06:50Z) - Generalizable Representation Learning for Mixture Domain Face
Anti-Spoofing [53.82826073959756]
Face anti-spoofing approach based on domain generalization(DG) has drawn growing attention due to its robustness forunseen scenarios.
We propose domain dy-namic adjustment meta-learning (D2AM) without using do-main labels.
To overcome the limitation, we propose domain dy-namic adjustment meta-learning (D2AM) without using do-main labels.
arXiv Detail & Related papers (2021-05-06T06:04:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.