Lifelong Language Pretraining with Distribution-Specialized Experts
- URL: http://arxiv.org/abs/2305.12281v1
- Date: Sat, 20 May 2023 21:15:19 GMT
- Title: Lifelong Language Pretraining with Distribution-Specialized Experts
- Authors: Wuyang Chen, Yanqi Zhou, Nan Du, Yanping Huang, James Laudon, Zhifeng
Chen, Claire Cu
- Abstract summary: Lifelong learning aims to enable information systems to learn from a continuous data stream across time.
We propose Lifelong-MoE, an MoE architecture that dynamically adds model capacity via adding experts with regularized pretraining.
Compared to existing lifelong learning approaches, Lifelong-MoE achieves better few-shot performance on 19 downstream NLP tasks.
- Score: 39.86463645187337
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pretraining on a large-scale corpus has become a standard method to build
general language models (LMs). Adapting a model to new data distributions
targeting different downstream tasks poses significant challenges. Naive
fine-tuning may incur catastrophic forgetting when the over-parameterized LMs
overfit the new data but fail to preserve the pretrained features. Lifelong
learning (LLL) aims to enable information systems to learn from a continuous
data stream across time. However, most prior work modifies the training recipe
assuming a static fixed network architecture. We find that additional model
capacity and proper regularization are key elements to achieving strong LLL
performance. Thus, we propose Lifelong-MoE, an extensible MoE
(Mixture-of-Experts) architecture that dynamically adds model capacity via
adding experts with regularized pretraining. Our results show that by only
introducing a limited number of extra experts while keeping the computation
cost constant, our model can steadily adapt to data distribution shifts while
preserving the previous knowledge. Compared to existing lifelong learning
approaches, Lifelong-MoE achieves better few-shot performance on 19 downstream
NLP tasks.
Related papers
- Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters [65.15700861265432]
We present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models.
Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters.
To preserve the zero-shot recognition capability of vision-language models, we introduce a Distribution Discriminative Auto-Selector.
arXiv Detail & Related papers (2024-03-18T08:00:23Z) - Towards Robust Continual Learning with Bayesian Adaptive Moment Regularization [51.34904967046097]
Continual learning seeks to overcome the challenge of catastrophic forgetting, where a model forgets previously learnt information.
We introduce a novel prior-based method that better constrains parameter growth, reducing catastrophic forgetting.
Results show that BAdam achieves state-of-the-art performance for prior-based methods on challenging single-headed class-incremental experiments.
arXiv Detail & Related papers (2023-09-15T17:10:51Z) - INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of
Language Models [40.54353850357839]
We show how we can employ submodular optimization to select highly representative subsets of the training corpora.
We show that the resulting models achieve up to $sim99%$ of the performance of the fully-trained models.
arXiv Detail & Related papers (2023-05-11T09:24:41Z) - Architecture, Dataset and Model-Scale Agnostic Data-free Meta-Learning [119.70303730341938]
We propose ePisode cUrriculum inveRsion (ECI) during data-free meta training and invErsion calibRation following inner loop (ICFIL) during meta testing.
ECI adaptively increases the difficulty level of pseudo episodes according to the real-time feedback of the meta model.
We formulate the optimization process of meta training with ECI as an adversarial form in an end-to-end manner.
arXiv Detail & Related papers (2023-03-20T15:10:41Z) - Preventing Zero-Shot Transfer Degradation in Continual Learning of
Vision-Language Models [13.340759455910721]
We propose a novel method to prevent zero-shot transfer degradation in the continual learning of vision-language models.
Our method outperforms other methods in the traditional class-incremental learning setting.
arXiv Detail & Related papers (2023-03-12T10:28:07Z) - Preventing Catastrophic Forgetting in Continual Learning of New Natural
Language Tasks [17.879087904904935]
Multi-Task Learning (MTL) is widely-accepted in Natural Language Processing as a standard technique for learning multiple related tasks in one model.
As systems usually evolve over time, adding a new task to an existing MTL model usually requires retraining the model from scratch on all the tasks.
In this paper, we approach the problem of incrementally expanding MTL models' capability to solve new tasks over time by distilling the knowledge of an already trained model on n tasks into a new one for solving n+1 tasks.
arXiv Detail & Related papers (2023-02-22T00:18:25Z) - Task Residual for Tuning Vision-Language Models [69.22958802711017]
We propose a new efficient tuning approach for vision-language models (VLMs) named Task Residual Tuning (TaskRes)
TaskRes explicitly decouples the prior knowledge of the pre-trained models and new knowledge regarding a target task.
The proposed TaskRes is simple yet effective, which significantly outperforms previous methods on 11 benchmark datasets.
arXiv Detail & Related papers (2022-11-18T15:09:03Z) - ELLE: Efficient Lifelong Pre-training for Emerging Data [91.52652408402815]
Current pre-trained language models (PLM) are typically trained with static data, ignoring that in real-world scenarios, streaming data of various sources may continuously grow.
We propose ELLE, aiming at efficient lifelong pre-training for emerging data.
ELLE consists of (1) function preserved model expansion, which flexibly expands an existing PLM's width and depth to improve the efficiency of knowledge acquisition; and (2) pre-trained domain prompts, which disentangle the versatile knowledge learned during pre-training and stimulate the proper knowledge for downstream tasks.
arXiv Detail & Related papers (2022-03-12T01:53:53Z) - Lifelong Pretraining: Continually Adapting Language Models to Emerging
Corpora [31.136334214818305]
We study a lifelong language model pretraining challenge where a PTLM is continually updated so as to adapt to emerging data.
Over a domain-incremental research paper stream and a chronologically ordered tweet stream, we incrementally pretrain a PTLM with different continual learning algorithms.
Our experiments show continual learning algorithms improve knowledge preservation, with logit distillation being the most effective approach.
arXiv Detail & Related papers (2021-10-16T09:59:33Z) - Continual Class Incremental Learning for CT Thoracic Segmentation [36.45569352490318]
Deep learning organ segmentation approaches require large amounts of annotated training data, which is limited in supply due to reasons of confidentiality and the time required for expert manual annotation.
Being able to train models incrementally without having access to previously used data is desirable.
In this setting, a model learns a new task effectively, but loses performance on previously learned tasks.
The Learning without Forgetting (LwF) approach addresses this issue via replaying its own prediction for past tasks during model training.
We show that LwF can successfully retain knowledge on previous segmentations, however, its ability to learn a new class decreases with the
arXiv Detail & Related papers (2020-08-12T20:08:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.