Related papers: Can BERT Refrain from Forgetting on Sequential Tasks? A Probing Study

Can BERT Refrain from Forgetting on Sequential Tasks? A Probing Study

URL: http://arxiv.org/abs/2303.01081v1
Date: Thu, 2 Mar 2023 09:03:43 GMT
Title: Can BERT Refrain from Forgetting on Sequential Tasks? A Probing Study
Authors: Mingxu Tao, Yansong Feng, Dongyan Zhao
Abstract summary: We find that pre-trained language models like BERT have a potential ability to learn sequentially, even without any sparse memory replay. Our experiments reveal that BERT can actually generate high quality representations for previously learned tasks in a long term, under extremely sparse replay or even no replay.
Score: 68.75670223005716
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large pre-trained language models help to achieve state of the art on a variety of natural language processing (NLP) tasks, nevertheless, they still suffer from forgetting when incrementally learning a sequence of tasks. To alleviate this problem, recent works enhance existing models by sparse experience replay and local adaption, which yield satisfactory performance. However, in this paper we find that pre-trained language models like BERT have a potential ability to learn sequentially, even without any sparse memory replay. To verify the ability of BERT to maintain old knowledge, we adopt and re-finetune single-layer probe networks with the parameters of BERT fixed. We investigate the models on two types of NLP tasks, text classification and extractive question answering. Our experiments reveal that BERT can actually generate high quality representations for previously learned tasks in a long term, under extremely sparse replay or even no replay. We further introduce a series of novel methods to interpret the mechanism of forgetting and how memory rehearsal plays a significant role in task incremental learning, which bridges the gap between our new discovery and previous studies about catastrophic forgetting.

Related papers

Preventing Catastrophic Forgetting in Continual Learning of New Natural Language Tasks [17.879087904904935]
Multi-Task Learning (MTL) is widely-accepted in Natural Language Processing as a standard technique for learning multiple related tasks in one model. As systems usually evolve over time, adding a new task to an existing MTL model usually requires retraining the model from scratch on all the tasks. In this paper, we approach the problem of incrementally expanding MTL models' capability to solve new tasks over time by distilling the knowledge of an already trained model on n tasks into a new one for solving n+1 tasks.
arXiv Detail & Related papers (2023-02-22T00:18:25Z)
Prompt Conditioned VAE: Enhancing Generative Replay for Lifelong Learning in Task-Oriented Dialogue [80.05509768165135]
generative replay methods are widely employed to consolidate past knowledge with generated pseudo samples. Most existing generative replay methods use only a single task-specific token to control their models. We propose a novel method, prompt conditioned VAE for lifelong learning, to enhance generative replay by incorporating tasks' statistics.
arXiv Detail & Related papers (2022-10-14T13:12:14Z)
Recitation-Augmented Language Models [85.30591349383849]
We show that RECITE is a powerful paradigm for knowledge-intensive NLP tasks. Specifically, we show that by utilizing recitation as the intermediate step, a recite-and-answer scheme can achieve new state-of-the-art performance.
arXiv Detail & Related papers (2022-10-04T00:49:20Z)
Lifelong Learning of Few-shot Learners across NLP Tasks [45.273018249235705]
We study the challenge of lifelong learning to few-shot learn over a sequence of diverse NLP tasks. We propose a continual meta-learning approach which learns to generate adapter weights from a few examples. We demonstrate our approach preserves model performance over training tasks and leads to positive knowledge transfer when the future tasks are learned.
arXiv Detail & Related papers (2021-04-18T10:41:56Z)
Learning Invariant Representation for Continual Learning [5.979373021392084]
A key challenge in Continual learning is catastrophically forgetting previously learned tasks when the agent faces a new one. We propose a new pseudo-rehearsal-based method, named learning Invariant Representation for Continual Learning (IRCL) Disentangling the shared invariant representation helps to learn continually a sequence of tasks, while being more robust to forgetting and having better knowledge transfer.
arXiv Detail & Related papers (2021-01-15T15:12:51Z)
Parrot: Data-Driven Behavioral Priors for Reinforcement Learning [79.32403825036792]
We propose a method for pre-training behavioral priors that can capture complex input-output relationships observed in successful trials. We show how this learned prior can be used for rapidly learning new tasks without impeding the RL agent's ability to try out novel behaviors.
arXiv Detail & Related papers (2020-11-19T18:47:40Z)
GiBERT: Introducing Linguistic Knowledge into BERT through a Lightweight Gated Injection Method [29.352569563032056]
We propose a novel method to explicitly inject linguistic knowledge in the form of word embeddings into a pre-trained BERT. Our performance improvements on multiple semantic similarity datasets when injecting dependency-based and counter-fitted embeddings indicate that such information is beneficial and currently missing from the original model.
arXiv Detail & Related papers (2020-10-23T17:00:26Z)
Hierarchical Multitask Learning Approach for BERT [0.36525095710982913]
BERT learns embeddings by solving two tasks, which are masked language model (masked LM) and the next sentence prediction (NSP) We adopt hierarchical multitask learning approaches for BERT pre-training. Our results show that imposing a task hierarchy in pre-training improves the performance of embeddings.
arXiv Detail & Related papers (2020-10-17T09:23:04Z)
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks [133.93803565077337]
retrieval-augmented generation models combine pre-trained parametric and non-parametric memory for language generation. We show that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
arXiv Detail & Related papers (2020-05-22T21:34:34Z)
Exploring Fine-tuning Techniques for Pre-trained Cross-lingual Models via Continual Learning [74.25168207651376]
Fine-tuning pre-trained language models to downstream cross-lingual tasks has shown promising results. We leverage continual learning to preserve the cross-lingual ability of the pre-trained model when we fine-tune it to downstream tasks. Our methods achieve better performance than other fine-tuning baselines on the zero-shot cross-lingual part-of-speech tagging and named entity recognition tasks.
arXiv Detail & Related papers (2020-04-29T14:07:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.