Related papers: Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora

Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora

URL: http://arxiv.org/abs/2110.08534v1
Date: Sat, 16 Oct 2021 09:59:33 GMT
Title: Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora
Authors: Xisen Jin, Dejiao Zhang, Henghui Zhu, Wei Xiao, Shang-Wen Li, Xiaokai Wei, Andrew Arnold, Xiang Ren
Abstract summary: We study a lifelong language model pretraining challenge where a PTLM is continually updated so as to adapt to emerging data. Over a domain-incremental research paper stream and a chronologically ordered tweet stream, we incrementally pretrain a PTLM with different continual learning algorithms. Our experiments show continual learning algorithms improve knowledge preservation, with logit distillation being the most effective approach.
Score: 31.136334214818305
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pretrained language models (PTLMs) are typically learned over a large, static corpus and further fine-tuned for various downstream tasks. However, when deployed in the real world, a PTLM-based model must deal with data from a new domain that deviates from what the PTLM was initially trained on, or newly emerged data that contains out-of-distribution information. In this paper, we study a lifelong language model pretraining challenge where a PTLM is continually updated so as to adapt to emerging data. Over a domain-incremental research paper stream and a chronologically ordered tweet stream, we incrementally pretrain a PTLM with different continual learning algorithms, and keep track of the downstream task performance (after fine-tuning) to analyze its ability of acquiring new knowledge and preserving learned knowledge. Our experiments show continual learning algorithms improve knowledge preservation, with logit distillation being the most effective approach. We further show that continual pretraining improves generalization when training and testing data of downstream tasks are drawn from different time steps, but do not improve when they are from the same time steps. We believe our problem formulation, methods, and analysis will inspire future studies towards continual pretraining of language models.

Related papers

LLM Post-Training: A Deep Dive into Reasoning Large Language Models [131.10969986056]
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications. Post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations.
arXiv Detail & Related papers (2025-02-28T18:59:54Z)
Adaptation Odyssey in LLMs: Why Does Additional Pretraining Sometimes Fail to Improve? [19.34040322172224]
We show that training a model on a text domain could degrade its perplexity on the test portion of the same domain. Our findings will guide us in determining when to adapt a model vs when to rely on its foundational capabilities.
arXiv Detail & Related papers (2024-10-08T00:37:16Z)
PILOT: A Pre-Trained Model-Based Continual Learning Toolbox [71.63186089279218]
This paper introduces a pre-trained model-based continual learning toolbox known as PILOT. On the one hand, PILOT implements some state-of-the-art class-incremental learning algorithms based on pre-trained models, such as L2P, DualPrompt, and CODA-Prompt. On the other hand, PILOT fits typical class-incremental learning algorithms within the context of pre-trained models to evaluate their effectiveness.
arXiv Detail & Related papers (2023-09-13T17:55:11Z)
Improving Language Plasticity via Pretraining with Active Forgetting [63.36484652568976]
We propose to use an active forgetting mechanism during pretraining, as a simple way of creating PLMs that can quickly adapt to new languages. Experiments with RoBERTa show that models pretrained with our forgetting mechanism demonstrate faster convergence during language adaptation.
arXiv Detail & Related papers (2023-07-03T17:12:44Z)
Lifelong Language Pretraining with Distribution-Specialized Experts [39.86463645187337]
Lifelong learning aims to enable information systems to learn from a continuous data stream across time. We propose Lifelong-MoE, an MoE architecture that dynamically adds model capacity via adding experts with regularized pretraining. Compared to existing lifelong learning approaches, Lifelong-MoE achieves better few-shot performance on 19 downstream NLP tasks.
arXiv Detail & Related papers (2023-05-20T21:15:19Z)
Can LMs Learn New Entities from Descriptions? Challenges in Propagating Injected Knowledge [72.63368052592004]
We study LMs' abilities to make inferences based on injected facts (or propagate those facts) We find that existing methods for updating knowledge show little propagation of injected knowledge. Yet, prepending entity definitions in an LM's context improves performance across all settings.
arXiv Detail & Related papers (2023-05-02T17:59:46Z)
Improving Temporal Generalization of Pre-trained Language Models with Lexical Semantic Change [28.106524698188675]
Recent research has revealed that neural language models at scale suffer from poor temporal generalization capability. We propose a simple yet effective lexical-level masking strategy to post-train a converged language model.
arXiv Detail & Related papers (2022-10-31T08:12:41Z)
Continual Pre-Training Mitigates Forgetting in Language and Vision [43.80547864450793]
We show that continually pre-trained models are robust against catastrophic forgetting. We provide empirical evidence supporting the fact that self-supervised pre-training is more effective in retaining previous knowledge than supervised protocols.
arXiv Detail & Related papers (2022-05-19T07:27:12Z)
ELLE: Efficient Lifelong Pre-training for Emerging Data [91.52652408402815]
Current pre-trained language models (PLM) are typically trained with static data, ignoring that in real-world scenarios, streaming data of various sources may continuously grow. We propose ELLE, aiming at efficient lifelong pre-training for emerging data. ELLE consists of (1) function preserved model expansion, which flexibly expands an existing PLM's width and depth to improve the efficiency of knowledge acquisition; and (2) pre-trained domain prompts, which disentangle the versatile knowledge learned during pre-training and stimulate the proper knowledge for downstream tasks.
arXiv Detail & Related papers (2022-03-12T01:53:53Z)
Online Continual Learning with Natural Distribution Shifts: An Empirical Study with Visual Data [101.6195176510611]
"Online" continual learning enables evaluating both information retention and online learning efficacy. In online continual learning, each incoming small batch of data is first used for testing and then added to the training set, making the problem truly online. We introduce a new benchmark for online continual visual learning that exhibits large scale and natural distribution shifts.
arXiv Detail & Related papers (2021-08-20T06:17:20Z)
AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing [0.0]
Transformer-based pretrained language models (T-PTLMs) have achieved great success in almost every NLP task. Transformed-based PTLMs learn universal language representations from large volumes of text data using self-supervised learning. These models provide good background knowledge to downstream tasks which avoids training of downstream models from scratch.
arXiv Detail & Related papers (2021-08-12T05:32:18Z)
Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks. We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.