ELLE: Efficient Lifelong Pre-training for Emerging Data
- URL: http://arxiv.org/abs/2203.06311v1
- Date: Sat, 12 Mar 2022 01:53:53 GMT
- Title: ELLE: Efficient Lifelong Pre-training for Emerging Data
- Authors: Yujia Qin, Jiajie Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong
Sun, Jie Zhou
- Abstract summary: Current pre-trained language models (PLM) are typically trained with static data, ignoring that in real-world scenarios, streaming data of various sources may continuously grow.
We propose ELLE, aiming at efficient lifelong pre-training for emerging data.
ELLE consists of (1) function preserved model expansion, which flexibly expands an existing PLM's width and depth to improve the efficiency of knowledge acquisition; and (2) pre-trained domain prompts, which disentangle the versatile knowledge learned during pre-training and stimulate the proper knowledge for downstream tasks.
- Score: 91.52652408402815
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current pre-trained language models (PLM) are typically trained with static
data, ignoring that in real-world scenarios, streaming data of various sources
may continuously grow. This requires PLMs to integrate the information from all
the sources in a lifelong manner. Although this goal could be achieved by
exhaustive pre-training on all the existing data, such a process is known to be
computationally expensive. To this end, we propose ELLE, aiming at efficient
lifelong pre-training for emerging data. Specifically, ELLE consists of (1)
function preserved model expansion, which flexibly expands an existing PLM's
width and depth to improve the efficiency of knowledge acquisition; and (2)
pre-trained domain prompts, which disentangle the versatile knowledge learned
during pre-training and stimulate the proper knowledge for downstream tasks. We
experiment ELLE with streaming data from 5 domains on BERT and GPT. The results
show the superiority of ELLE over various lifelong learning baselines in both
pre-training efficiency and downstream performances. The codes are publicly
available at https://github.com/thunlp/ELLE.
Related papers
- CTP: Towards Vision-Language Continual Pretraining via Compatible
Momentum Contrast and Topology Preservation [128.00940554196976]
Vision-Language Continual Pretraining (VLCP) has shown impressive results on diverse downstream tasks by offline training on large-scale datasets.
To support the study of Vision-Language Continual Pretraining (VLCP), we first contribute a comprehensive and unified benchmark dataset P9D.
The data from each industry as an independent task supports continual learning and conforms to the real-world long-tail nature to simulate pretraining on web data.
arXiv Detail & Related papers (2023-08-14T13:53:18Z) - ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP)
ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective.
We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z) - Revisit Few-shot Intent Classification with PLMs: Direct Fine-tuning vs. Continual Pre-training [20.98770732015944]
Few-shot intent detection involves training a deep learning model to classify utterances based on their underlying intents using only a small amount of labeled data.
We show that continual pre-training may not be essential, since the overfitting problem of PLMs on this task may not be as serious as expected.
To maximize the utilization of the limited available data, we propose a context augmentation method and leverage sequential self-distillation to boost performance.
arXiv Detail & Related papers (2023-06-08T15:26:52Z) - Lifelong Language Pretraining with Distribution-Specialized Experts [39.86463645187337]
Lifelong learning aims to enable information systems to learn from a continuous data stream across time.
We propose Lifelong-MoE, an MoE architecture that dynamically adds model capacity via adding experts with regularized pretraining.
Compared to existing lifelong learning approaches, Lifelong-MoE achieves better few-shot performance on 19 downstream NLP tasks.
arXiv Detail & Related papers (2023-05-20T21:15:19Z) - On the Usage of Continual Learning for Out-of-Distribution
Generalization in Pre-trained Language Models of Code [12.708117108874083]
Pre-trained language models (PLMs) have become a prevalent technique in deep learning for code.
We study two widely used PLM architectures on two downstream tasks, API call and API usage prediction.
To address these issues, we implement five continual learning approaches, including replay-based and regularization-based methods.
arXiv Detail & Related papers (2023-05-06T18:00:21Z) - Knowledge Distillation as Efficient Pre-training: Faster Convergence,
Higher Data-efficiency, and Better Transferability [53.27240222619834]
Knowledge Distillation as Efficient Pre-training aims to efficiently transfer the learned feature representation from pre-trained models to new student models for future downstream tasks.
Our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time.
arXiv Detail & Related papers (2022-03-10T06:23:41Z) - Lifelong Pretraining: Continually Adapting Language Models to Emerging
Corpora [31.136334214818305]
We study a lifelong language model pretraining challenge where a PTLM is continually updated so as to adapt to emerging data.
Over a domain-incremental research paper stream and a chronologically ordered tweet stream, we incrementally pretrain a PTLM with different continual learning algorithms.
Our experiments show continual learning algorithms improve knowledge preservation, with logit distillation being the most effective approach.
arXiv Detail & Related papers (2021-10-16T09:59:33Z) - On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z) - Online Continual Learning with Natural Distribution Shifts: An Empirical
Study with Visual Data [101.6195176510611]
"Online" continual learning enables evaluating both information retention and online learning efficacy.
In online continual learning, each incoming small batch of data is first used for testing and then added to the training set, making the problem truly online.
We introduce a new benchmark for online continual visual learning that exhibits large scale and natural distribution shifts.
arXiv Detail & Related papers (2021-08-20T06:17:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.