Lifelong Pretraining: Continually Adapting Language Models to Emerging
Corpora
- URL: http://arxiv.org/abs/2110.08534v1
- Date: Sat, 16 Oct 2021 09:59:33 GMT
- Title: Lifelong Pretraining: Continually Adapting Language Models to Emerging
Corpora
- Authors: Xisen Jin, Dejiao Zhang, Henghui Zhu, Wei Xiao, Shang-Wen Li, Xiaokai
Wei, Andrew Arnold, Xiang Ren
- Abstract summary: We study a lifelong language model pretraining challenge where a PTLM is continually updated so as to adapt to emerging data.
Over a domain-incremental research paper stream and a chronologically ordered tweet stream, we incrementally pretrain a PTLM with different continual learning algorithms.
Our experiments show continual learning algorithms improve knowledge preservation, with logit distillation being the most effective approach.
- Score: 31.136334214818305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained language models (PTLMs) are typically learned over a large, static
corpus and further fine-tuned for various downstream tasks. However, when
deployed in the real world, a PTLM-based model must deal with data from a new
domain that deviates from what the PTLM was initially trained on, or newly
emerged data that contains out-of-distribution information. In this paper, we
study a lifelong language model pretraining challenge where a PTLM is
continually updated so as to adapt to emerging data. Over a domain-incremental
research paper stream and a chronologically ordered tweet stream, we
incrementally pretrain a PTLM with different continual learning algorithms, and
keep track of the downstream task performance (after fine-tuning) to analyze
its ability of acquiring new knowledge and preserving learned knowledge. Our
experiments show continual learning algorithms improve knowledge preservation,
with logit distillation being the most effective approach. We further show that
continual pretraining improves generalization when training and testing data of
downstream tasks are drawn from different time steps, but do not improve when
they are from the same time steps. We believe our problem formulation, methods,
and analysis will inspire future studies towards continual pretraining of
language models.
Related papers
- Adaptation Odyssey in LLMs: Why Does Additional Pretraining Sometimes Fail to Improve? [19.34040322172224]
We show that training a model on a text domain could degrade its perplexity on the test portion of the same domain.
Our findings will guide us in determining when to adapt a model vs when to rely on its foundational capabilities.
arXiv Detail & Related papers (2024-10-08T00:37:16Z) - PILOT: A Pre-Trained Model-Based Continual Learning Toolbox [71.63186089279218]
This paper introduces a pre-trained model-based continual learning toolbox known as PILOT.
On the one hand, PILOT implements some state-of-the-art class-incremental learning algorithms based on pre-trained models, such as L2P, DualPrompt, and CODA-Prompt.
On the other hand, PILOT fits typical class-incremental learning algorithms within the context of pre-trained models to evaluate their effectiveness.
arXiv Detail & Related papers (2023-09-13T17:55:11Z) - Improving Language Plasticity via Pretraining with Active Forgetting [63.36484652568976]
We propose to use an active forgetting mechanism during pretraining, as a simple way of creating PLMs that can quickly adapt to new languages.
Experiments with RoBERTa show that models pretrained with our forgetting mechanism demonstrate faster convergence during language adaptation.
arXiv Detail & Related papers (2023-07-03T17:12:44Z) - Lifelong Language Pretraining with Distribution-Specialized Experts [39.86463645187337]
Lifelong learning aims to enable information systems to learn from a continuous data stream across time.
We propose Lifelong-MoE, an MoE architecture that dynamically adds model capacity via adding experts with regularized pretraining.
Compared to existing lifelong learning approaches, Lifelong-MoE achieves better few-shot performance on 19 downstream NLP tasks.
arXiv Detail & Related papers (2023-05-20T21:15:19Z) - Can LMs Learn New Entities from Descriptions? Challenges in Propagating
Injected Knowledge [72.63368052592004]
We study LMs' abilities to make inferences based on injected facts (or propagate those facts)
We find that existing methods for updating knowledge show little propagation of injected knowledge.
Yet, prepending entity definitions in an LM's context improves performance across all settings.
arXiv Detail & Related papers (2023-05-02T17:59:46Z) - Improving Temporal Generalization of Pre-trained Language Models with
Lexical Semantic Change [28.106524698188675]
Recent research has revealed that neural language models at scale suffer from poor temporal generalization capability.
We propose a simple yet effective lexical-level masking strategy to post-train a converged language model.
arXiv Detail & Related papers (2022-10-31T08:12:41Z) - Continual Pre-Training Mitigates Forgetting in Language and Vision [43.80547864450793]
We show that continually pre-trained models are robust against catastrophic forgetting.
We provide empirical evidence supporting the fact that self-supervised pre-training is more effective in retaining previous knowledge than supervised protocols.
arXiv Detail & Related papers (2022-05-19T07:27:12Z) - ELLE: Efficient Lifelong Pre-training for Emerging Data [91.52652408402815]
Current pre-trained language models (PLM) are typically trained with static data, ignoring that in real-world scenarios, streaming data of various sources may continuously grow.
We propose ELLE, aiming at efficient lifelong pre-training for emerging data.
ELLE consists of (1) function preserved model expansion, which flexibly expands an existing PLM's width and depth to improve the efficiency of knowledge acquisition; and (2) pre-trained domain prompts, which disentangle the versatile knowledge learned during pre-training and stimulate the proper knowledge for downstream tasks.
arXiv Detail & Related papers (2022-03-12T01:53:53Z) - Online Continual Learning with Natural Distribution Shifts: An Empirical
Study with Visual Data [101.6195176510611]
"Online" continual learning enables evaluating both information retention and online learning efficacy.
In online continual learning, each incoming small batch of data is first used for testing and then added to the training set, making the problem truly online.
We introduce a new benchmark for online continual visual learning that exhibits large scale and natural distribution shifts.
arXiv Detail & Related papers (2021-08-20T06:17:20Z) - AMMUS : A Survey of Transformer-based Pretrained Models in Natural
Language Processing [0.0]
Transformer-based pretrained language models (T-PTLMs) have achieved great success in almost every NLP task.
Transformed-based PTLMs learn universal language representations from large volumes of text data using self-supervised learning.
These models provide good background knowledge to downstream tasks which avoids training of downstream models from scratch.
arXiv Detail & Related papers (2021-08-12T05:32:18Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.